How to Recover Control Center from Hardware Failure – Zenoss

Applies To

Zenoss 5.x

Summary

Hardware failure in Control Center can take various forms, including:

Running out of disk space on one or more of the partitions that store Control Center, Docker or Zenoss data.
Power failure on a Control Center host.

In either case, data might not have been written to disk, leaving your system in an unusable state.

Symptoms

The symptoms of low disk space or power failure include system instability, data loss, and log file entries.

Procedures

The following sections describe some possible hardware failure results and their associated remediation steps.

How to Check Diskspace

The normal du/df commands do not provide useful information when applied to logical volumes managed by the Logical Volume Manager (LVM). This is because there are volume groups that consist of physical volumes that are configured into logical volumes. Use the following LVM-specific commands to get more information.

Identify Volume Groups

To determine which volume groups exist, issue the following command:

# vgs

Display Volume Group Information

To determine the volume group total space, how much free space exists within the volume group and how much space is allocated to logical volumes, issue the following command:

# vgdisplay [vg_name]

Identify Physical Volumes

To determine which physical volumes make up the volume group and which logical volumes exist within the volume group, include -v:

# vgdisplay -v [vg_name]

Display Logical Volume Sizes

To display the size of logical volumes:

# lvs [-a] [--units hHbBsSkKmMgGtTpPeE]

Note: The lvs command can output in units you choose. From the lvs manpage:

--units hHbBsSkKmMgGtTpPeE
       All  sizes  are  output in these units: (h)uman-readable, (b)ytes, 
       (s)ectors, (k)ilobytes, (m)egabytes, (g)igabytes,(t)erabytes, 
       (p)etabytes, (e)xabytes.  Capitalise to use multiples of 1000 (S.I.) 
       instead of 1024.  Can also specify custom units e.g. --units 3M

For additional information on the Logical Volume Manager, see the Red Hat Enterprise Logical Volume Manager Administration, LVM Administrator Guide located at: https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Logical_Volume_Manager_Administration/index.html.

Recovery Steps For Various Scenarios

The Docker Filesystem (/var/lib/docker) Out Of Space

If /var/lib/docker has no available disk space, delete existing containers to free up space, for example:

Shut down serviced.
Delete all existing containers:
docker rm $(docker ps -qa)

If some Docker metadata wasn’t fully written to disk, problems can manifest in various ways, for example:

The Docker daemon could refuse to start. This is often caused by the presence of one or more zero-length files in /var/lib/docker (specifically, /var/lib/docker/trust/official.json and /var/lib/docker/repositories-btrfs are known offenders). You can safely delete these files and restart Docker to recover from this.
The serviced daemon could fail to start one or more internal services, logging API Error (500). The Docker logs will show a more specific error. This has, so far, only been seen on Docker versions less than 1.6.0. This occurs when Docker’s own internal graph database is corrupted. The Docker logs reported, for example “Cannot find child for /serviced-isvcs_logstash”. To correct this issue, perform the following:
1. Shut down serviced (it should shut down on its own anyway)
2. Run the command docker ps -a to display all (stopped) containers remaining on the system. Several of them might have no status or name. This is the problem.
3. Remove all containers:
  docker rm $(docker ps -aq)
4. Start serviced

Corruption Of Control Center Zookeeper (/opt/serviced/var/isvcs/zookeeper) Files

The most likely effect of Zookeeper becoming corrupted is that services will not start, or will not start correctly. They might also report bad networking imports. Virtual hosts might not work properly. This data can be rebuilt. Perform the following:

Shut down serviced
Remove the directory:
sudo rm -rf /opt/serviced/var/isvcs/zookeeper
Delete all existing containers:
docker rm $(docker ps -qa)
Start serviced
Restart serviced on all remote hosts

Corruption of Control Center HBase/OpenTSDB (/opt/serviced/var/isvcs/opentsdb) Files

If the internal HBase becomes corrupted, it may be possible for it to recover by itself. It will attempt this on startup, and log accordingly. The default heap settings may, on very large systems, be inadequate for this recovery process. You’ll know because it will keep shutting itself down with an error indicating it’s out of heap. You can increase the heap temporarily:

Attach to the running container:
docker exec -it serviced-isvcs_opentsdb bash

Modify the max heap size:

echo "export HBASE_HEAPSIZE=2048" >> /opt/hbase*/conf/hbase-env.sh

Restart HBase:

supervisorctl -c /opt/zenoss/etc/supervisor.conf 
restart hbase

Exit supervisorctl:
```
exit
```
Exit the container:
```
exit
```

These settings will be reverted when serviced is restarted. If HBase repairs its corruption successfully, it will start normally. If the repair fails, the HMaster logs will indicate this and you might need to proceed with additional HBase recovery, for example:

Attach to the running container:

docker exec -it serviced-isvcs_opentsdb bash

Run the HBase repair tool:

JAVA_HOME="/usr/lib/jvm/java-7-openjdk-amd64" HBASE_HOME=/opt/hbase-0.94.16 /opt/hbase-0.94.16/bin/hbase hbck -fix

If HBase is corrupted beyond repair, you might need to remove the existing data to allow it to start. Perform the following:

On the master host, stop all processes in the internal metrics container:

docker exec -it serviced-isvcs_opentsdb  supervisorctl -c /opt/zenoss/etc/supervisor.conf stop all

Remove the HBase data:

rm -rf /opt/serviced/var/isvcs/opentsdb/hbase/.*

Start the metrics processes:

docker exec -it serviced-isvcs_opentsdb  supervisorctl -c /opt/zenoss/etc/supervisor.conf start all

Corruption of Zenoss RabbitMQ (/opt/serviced/var/volumes/*/rabbitmq) Files

If RabbitMQ data has become corrupted, Rabbit will be unable to start. It is possible that it will start and some processes will be unable to connect. If this happens, you should remove the existing data. Any messages that were in the queues when the hardware failure occurred will be lost.

Stop the RabbitMQ service:
serviced service stop rabbitmq

Delete the RabbitMQ data:

export SERVICE_ID=$(serviced service status Zenoss.resmgr | sed -n '2p' | awk {'print $2'})
export SVCROOT=/opt/serviced/var/volumes/$SERVICE_ID 
rm -rf $SVCROOT/rabbitmq 
rm -f $SVCROOT/.rabbitmq.serviced.initialized

Restart RabbitMQ:
serviced service start rabbitmq

Corruption of Zenoss HBase (/opt/serviced/var/volumes/*/hbase-master) Files

HBase can be recovered in a similar way as the internal HBase. Perform the following

Attach to the running container:
serviced service attach hmaster
Run the HBase repair tool:
su - hbase -c ”hbase hbck -fix”

If HBase is corrupted beyond repair, you may need to remove the existing data to allow it to start.

Warning: Performing the following steps removes all performance data.

To remove existing data to enable HBase to start, perform the following:

On the master host, stop all HBase and OpenTSDB processes:
serviced service stop hbase
serviced service stop opentsdb

Remove the HBase data:

export SERVICE_ID=$(serviced service status Zenoss.resmgr | sed -n '2p' | awk {'print $2'})
export SVCROOT=/opt/serviced/var/volumes/$SERVICE_ID
sudo rm -rf $SVCROOT/hbase-*
sudo rm -rf $SVCROOT/.hbase-*.serviced.initialized

Start HBase and OpenTSDB:
serviced service start hbase
serviced service start opentsdb

Corruption of Zenoss Zookeeper (/opt/serviced/var/volumes//hbase-zookeeper-) Files

It is possible, though not particularly likely, for the Zookeeper instance(s) used for HBase to become corrupted on hardware failure. Recovery consists of removing the corrupted data:

Stop HBase:
serviced service stop hbase

Delete the Zookeeper data:

export SERVICE_ID=$(serviced service status Zenoss.resmgr | sed -n '2p' | awk {'print $2'})
export SVCROOT=/opt/serviced/var/volumes/$SERVICE_ID
sudo rm -rf $SVCROOT/hbase-zookeeper-*
sudo rm -rf $SVCROOT/.hbase-zookeeper-*.serviced.initialized

Restart HBase:
serviced service start hbase

Corruption of Event Indexes (/opt/serviced/var/volumes/*/zeneventserver/index)

If the Lucene-based event index becomes corrupted, zeneventserver will automatically rebuild it once the corrupted index data is removed. Perform the following:

Stop zeneventserver:
serviced service stop zeneventserver

Delete the index data:

export SERVICE_ID=$(serviced service status Zenoss.resmgr | sed -n '2p' | awk {'print $2'})
export SVCROOT=/opt/serviced/var/volumes/$SERVICE_ID
rm -rf $SVCROOT/zeneventserver/index

Start zeneventserver:
serviced service start zeneventserver

The zeneventserver log will indicate the indexes are being rebuilt.

Corruption of Catalog Service Indexes (/opt/serviced/var/volumes/*/zencatalogservice)

If the Lucene-based model index becomes corrupted, you must remove the data and rebuild the catalog. Perform the following:

Stop zencatalogservice:
serviced service stop zencatalogservice

Remove the catalog data:

export SERVICE_ID=$(serviced service status Zenoss.resmgr | sed -n '2p' | awk {'print $2'})
export SVCROOT=/opt/serviced/var/volumes/$SERVICE_ID
rm -rf $SVCROOT/zencatalogservice
rm -rf $SVCROOT/.zencatalogservice.serviced.initialized

Start zencatalogservice:
serviced service start zencatalogservice
With zencatalogservice started, rebuild the catalog:
(Note, because this can require time to complete, perform this in a screen session)
```
serviced service attach zope/0 su - zenoss -c "zencatalog run --createcatalog --forceindex"
```

Corruption of Redis (/opt/serviced/var/volumes/*/redis) Files

If hardware failure occurs in the middle of a snapshot, redis may write an incomplete or zero-length database dump to disk. The redis server will be unable to start and logs will show that it is unable to load the database. Additionally, serviced will continually attempt to restart it. Recovery requires deleting the bad snapshot file.
(NOTE: any in-flight data in Redis at the time of hardware failure will be lost).

Attach to the redis container:
serviced service attach redis
Check the logs to verify the issue:
tail -f /var/log/redis/redis.log
If this is the problem, delete the bad snapshot:
rm -f /var/lib/redis/dump.rdb

Redis should recover when it is restarted by serviced (usually within 10 seconds).

Comments

Mark Passell *SUSPENDED* July 07, 2016 17:02

This needs to be updated to reflect our moving away from btrfs and also because we moved Java to usr/bin/java in the iscvcs HBase container. The Fix command doesn't work without the correct path.

Comment actions Permalink
Jagadish Nagasamudram *SUSPENDED* January 23, 2020 11:59

Is this article applicable to 6.x?

Comment actions Permalink
Eric Thirolle January 23, 2020 16:08

Jagadish, it is almost entirely applicable to RM 6.x. But there are a couple of areas where the instructions are slightly different between RM 5.x and 6.x:
(1) see the above note from Mark Passell re btrfs & change to HBASE_HOME
(2) zencatalogservice has been replaced by Solr service, which requires some small changes to the commands for repairing the catalog
I will work on updating the article to reflect those changes.

Comment actions Permalink