Follow

How To Confirm Zenoss Is Healthy After A Hard System Down

Applies To

  • Zenoss 4.x

Summary

After power has been restored to your Zenoss systems and the host server has booted, the steps required to ensure that Zenoss recovers to a healthy state depend on whether you have Zenoss and its related services configured to start automatically at boot or not. The following sections cover each scenario.

Procedure

Scenario A: Zenoss and related services are set to start automatically on boot.

After the server and Zenoss daemons have started, verify the following:

  1. Zends (mysql) is running:
    # service zends status 
  2. Rabbitmq-server is running:
    # service rabbitmq-server status
  3. memcached is running:
    # service memcached status
  4. Zenoss daemons are running:
    # service zenoss status

If Zends, Rabbitmq, mecached, or any combination of the above are not running, please refer to section A-1 in the “troubleshooting” section below.

If one or more of the Zenoss daemons are not running, but Zends, Rabbitmq, and memcached are running, refer to Section A-2 in the “troubleshooting” section below.

If Zends, Rabbitmq, memcached, and all Zenoss daemons appear to be running correctly, proceed with the following final system checks.

Access the Zenoss user interface and Look for Zenoss events that may indicate a problem. Check the event console for events related to Zenoss (for example, stopped daemons or heartbeat failures from daemons).

After enough time has passed that events from monitored devices would be expected to arrive in the Zenoss event console, confirm the presence of new device events in the console (“new” events being events that have arrived subsequent to the power interruption). If none are present, and if such events would be expected, test the event subsystem by generating a condition that would trigger an event — for example, by running the Logger command on a monitored Linux server that sends its events to Zenoss. Confirm that the events arrive in the Zenoss console.

Confirm receipt of any email notifications that should be triggered by the incoming events.

Confirm that Zenoss is able to poll devices for monitored data points by verifying that performance graphs continue to update. Wait 10–15 minutes after the completion of the system boot (or 2–3 polling cycles’ worth of time if the default 5 minute interval has been modified) then check one of the performance graphs for a device (for example, the CPU utilization graph of a server) and verify that the graph is continuing to update.

If multiple collectors are being used, this test should be repeated for one device per collector.

To locate devices monitored by particular collectors, navigate to Advanced > Collectors to display the list of collectors.

For each collector listed, click on the collector name, which will bring up the Overview page for the collector. In the lower portion of the Overview page you will see a list of devices monitored by that collector. Click on a device in the list and verify that its performance graphs are being updated.

Navigate through the various pages of the user interface and verify that each page renders correctly and pages display supplementary information such as event counts when appropriate.

If any of the above checks reveals a problem with Zenoss, refer to Section A-3 below in the Troubleshooting section.

Optional: if no problems are discovered during these checks, you might want to complete the steps in Section A-3 below, as a precautionary measure.

Scenario B: Zenoss and related services are NOT set to start automatically on boot.

  1. Verify that no Zenoss process have stray PIDs leftover after the hard shutdown by searching for processes owned by the Zenoss user:
    ps ax | grep Zenoss
  2. If any Zenoss processes remain, issue the kill command to stop the process using its PID:
    kill[PID of stuck process]

After you confirm there are no stray PIDs, begin starting the services Zenoss depends on. After the services successfully start, launch Zenoss. Pause between each service startup to check and verify the service is running. Start the services in this order:

  • Zends
  • Rabbitmq-server
  • memcached
  • Zenoss

Zends

  1. Start Zends:
    service zends start
  2. After zends starts. pause briefly and then run the command service zends status to verify that it continued to run after starting.

If Zends fails to start or stops running shortly after starting, look for any obvious causes by examining any Zends log files that might have been generated. The file names, when present, end with .err and are located in the /opt/zends/data/ directory of the host server. If error logs are present but provide no obvious cause for a failure to start, contact Zenoss Support.

If Zends starts successfully and you completed successful troubleshooting steps to start Zends, you can check the health of the Zends tables to verify none were corrupted during the power outage. To check the tables:

  1. log on to the command line of the host server and change to the Zenoss user:
    su – zenoss
  2. connect to the Zends client:
    zends -u root 

    Note: if your Zends has a root password, the command is:

    zends -u root -p
  3. To get a list of databases, run the following:
    show databases;
  4. To examine each database in turn for tables with problems, run the following commands at the zends command prompt:
    1. use [database name]; for example:

      use zenoss_zep;
    2. show table status;
      Examine the output for any tables with the word crashed appearing in the comments column. A more expedient search can be completed on each database by narrowing the show table status command, for example:
      show table status where Comment like '%crashed%';

    If you identify any crashed tables, contact Zenoss Support.

    If zends starts successfully and its tables are healthy, continue starting the additional services.

rabbitmq-server

Start rabbitmq-server:

service rabbitmq-server start

Verify the service continues to run after starting. A moment or two after the service starts, issue the following command:

service rabbitmq-server status

memcached

Start memcached:

service memcached start

Verify the service continues to run after starting. A moment or two after the service starts, issue the following command:

service memcached status

Zenoss

  1. Start Zenoss:

    service zenoss start
  2. Verify all expected daemons have started correctly:
    service zenoss status

If individual Zenoss daemons failed to start, proceed to Section A-2 of the Troubleshooting section.

Troubleshooting

Section A-1

Stop Zenoss and its related daemons using the following commands, in order:

  1. service zenoss stop
  2. service memcached stop
  3. service rabbitmq-server stop
  4. service zends stop

After all zenoss daemons have (reportedly) stopped, verify that all Zenoss deamons have successfully stopped and that none are hung. Search for processes owned by the Zenoss user:

ps ax | grep Zenoss

If any Zenoss processes remain, wait a few moments and repeat the command to verify that they stop. If they fail to stop after a reasonable amount of time, use the kill command to stop the process using its PID:

kill [PID of stuck process]

When all services and daemons are stopped, begin starting the services in the reverse order they were shut down. Pause between each service to verify the service is running:

service zends start

Run the service zends status command a moment or two after zends finishes starting to ensure that it continued to run.

If Zends fails to start or stops running shortly after starting, look for any obvious causes by examining any Zends log files that might have been generated. The file names, when present, end with .err and are located in the /opt/zends/data/ directory of the host server. If error logs are present but provide no obvious cause for a failure to start, contact Zenoss Support.

If Zends starts successfully and you completed successful troubleshooting steps to start Zends, you can check the health of the Zends tables to verify none were corrupted during the power outage. To check the tables:

  1. log on to the command line of the host server and change to the Zenoss user:
    su – zenoss
  2. connect to the Zends client:
    zends -u root 

    Note: if your Zends has a root password, the command is:

    zends -u root -p
  3. To get a list of databases, run the following:
    show databases;
  4. To examine each database in turn for tables with problems, run the following commands at the zends command prompt:
    1. use[database name]

      For example:

      use zenoss_zep;
    2. show table status;
      Examine the output for any tables with the word crashed appearing in the comments column. A more expedient search can be completed on each database by narrowing the show table status command, for example:
      show table status where Comment like '%crashed%';

If you identify any crashed tables, contact Zenoss Support.

If Zends starts successfully and its tables are healthy, continue starting the additional services, in the following order:

  1. rabbitmq-server

    Start rabbitmq-server:

    service rabbitmq-server start

    Verify the service continues to run after starting. A moment or two after the service starts, issue the following command:
    service rabbitmq-server status

  2. memcached

    Start memcached:

    service memcached start

    Verify the service continues to run after starting. A moment or two after the service starts, issue the following command:

    service memcached status
  3. Zenoss
    1. Start Zenoss:
      service zenoss start
    2. Verify all expected daemons have started correctly:
      service zenoss status

    If individual Zenoss daemons failed to start, proceed to Section A-2 of the Troubleshooting section.

Section A-2

To troubleshoot one or more individual Zenoss daemons that failed to start, switch to the Zenoss user:

su – zenoss

Attempt to start the daemon individually:

[daemon name] start

For example:

zenperfsnmp start

If the daemon fails to start, check its log file for clues to the cause of the failure. The zenoss log files are located in the /opt/zenoss/log directory of the host server. If the log files do not offer helpful clues, you can attempt to start the deamon in the foreground with a higher logging level to see if the error causing the failure can be identified in the logging that follows. For example, if you are troubleshooting zenperfsnmp, run the command:

zenperfsnmp run -v 10 - -cycle

Repeat the process as required for any deamons that are unable to start. If no obvious cause for the failures can be identified or you need help interpreting the logged errors, contact Zenoss Support.

Section A-3

Stop Zenoss:

service zenoss stop

After all zenoss daemons have (reportedly) stopped, verify all the Zenoss deamons have actually successfully stopped and that none are hung. Search for processes owned by the Zenoss user:

ps ax | grep Zenoss

If any Zenoss processes remain running, wait a few moments and repeat the command to verify they stop. If they fail to stop after a reasonable amount of time (??), use the kill command to stop the process using its PID:

kill [PID of stuck process]

When you have confirmed that all Zenoss daemons have stopped, start Zenoss:

service zenoss start

If some individual Zenoss daemons remain unstarted, refer to Section A-2 of this Troubleshooting section.

Was this article helpful?
0 out of 0 found this helpful
Have more questions? Submit a request

Comments

Powered by Zendesk