Follow

Zenoss Master - Staged Startup and Shutdown Best Practices for Maintenance

Applies To

  • Zenoss 5.x
  • Zenoss 6.x

Summary

When it becomes necessary to perform maintenance on Zenoss Resource Manager (RM), it is useful to perform controlled, incremental shutdowns and startups for Control Center (CC) and Resource Manager. This KB discusses the best practices for performing these tasks.

Procedures

Controlled Shutdown Procedure

  1. Stop RM and top-level applications.

    The RM application should be stopped before stopping Control Center. This will allow for a controlled startup of RM services after Control Center is restarted.

    Use the following command to stop RM:

    # serviced service stop Zenoss.resmgr
    

    Wait for all of the applications to stop. To determine if the applications are stopped, watch the status of services until all have a state of Stopped. The following command lists every service that is NOT stopped. It is safe to ignore services that have no status - these are the “folder” services that do not run anything:

    # serviced service status | grep -v Stopped
  2. Stop the CC master:
    # systemctl stop serviced
  3. Stop all resource pool workers (formerly known as CC agents).

    On each resource pool worker node, stop serviced:

    # systemctl stop serviced
  4. Remove stray docker containers.

    NOTE: Perform this step on both the master and the resource pool worker nodes.

    Although stopping serviced is normally enough to stop all docker containers on the node, sometimes stray containers remain. Because stray containers can prevent NFS from unmounting in the next step, ensure all containers are stopped and no strays remain.

      1. Determine if there are remaining containers, including those in an Exited status:
        # docker ps -a 
      2. If any containers remain, use the following command to remove them:
        # docker ps -qa  | xargs --no-run-if-empty docker rm -fv
      3. rm Command Hangs

        In some edge cases, that last command can hang because a container will not die, most frequently because of an NFS hang. To resolve the issue and stop the container, perform the following:

        1. Stop NFS:
          # systemctl stop nfs
        2. Stop docker:
          # systemctl stop docker
        3. Start NFS:
          # systemctl start nfs
        4. Start docker:
          # systemctl start docker
        5. Kill remaining container(s):
          # docker ps -qa  | xargs --no-run-if-empty docker rm -fv

          NOTE: If a container continues to persist, a last resort is to reboot the resource pool worker node.

  5. Unmount all resource pool NFS mount points.

    NOTE: This step applies to the resource pool worker nodes.

    When serviced and the docker containers are stopped, umount the NFS mount points. This prevents the possibility of any “stale NFS mount” errors on the resource pool worker nodes in cases where the master’s storage has to be completely replaced.

    1. Check for active mounts:
      # grep serviced /proc/mounts

      A typical agent-side mount has the format

      <ccMasterIP>:/serviced_volumes_v2/<tenantID> /opt/serviced/var/volumes/<tenantID>
      

      For example:

       1.8.1.4:/serviced_volumes_v2/erjcdpennn9yqfcg82hfso6q7 /opt/serviced/var/volumes/erjcdpennn9yqfcg82hfso6q7
      

      If there are no serviced mounts, umounting is complete. If mounts exist for serviced, they must be removed. Proceed to using the force option in step b.

      Note: If the master has multiple tenant applications, there can be multiple mounts.

    2. Use the force option to unmount the volumes:
      # umount -f /opt/serviced/var/volumes/<tenantID>
    3. Confirm the mount point is removed. Consult /proc/mounts. Note that in some edge cases, a simple umount does not work and the workaround is to do a ‘lazy’ unmount and restart NFS:
        # umount -f -l /opt/serviced/var/volumes/<tenantID>
        # systemctl restart nfs
      
    4. If the umount still fails, reboot the resource pool worker node.

Controlled Startup Procedure

This procedure helps avoid a chaotic startup in a large production environment. This enables the rapid isolation and resolution of any problems that appear during startup/restart. In terms of order of operations, there are many variations concerning which and what order RM services can be started. The key point of this procedure is to defer starting the collection services until all other RM services required to support the collectors are up and running.

Start CC Master

  1. Determine if the CC deployment is using a Zookeeper ensemble.
    To determine if an ensemble is configured, inspect the /etc/default/serviced file and look for the definition:
    SERVICED_ISVCS_ZOOKEEPER_QUORUM
    • If the definition SERVICED_ISVCS_ZOOKEEPER_QUORUM is not set, this means the deployment is not using a Zookeeper ensemble and the CC master can be started.
    • If the definition SERVICED_ISVCS_ZOOKEEPER_QUORUM is set, this means the deployment is using a Zookeeper ensemble and the CC master can not be started without the other nodes that make up the ensemble. Before starting the CC master, Control Center must be started on the resource pool worker nodes defined by that quorum string.
  2. Start the CC master:
    # systemctl start serviced
    
  3. Monitor the CC log file to verify successful startup:
    # journalctl -u serviced -o cat -f

Restart Worker Nodes and RM

For the following steps, verify that RM services for each step are started and show clean health checks before proceeding to the next step. If the services do not start, or do not pass their health checks, stop and resolve the underlying issue before continuing.

  1. Restart Resource Pool Worker Nodes
    1. Login to the resource pool worker nodes and restart serviced.

      Note: This step is not required for any resource pool worker node started in the previous step to enable the Zookeeper ensemble.

    2. Monitor the Hosts page in the CC UI to verify all hosts are up and running.
  2. Review IP Assignments

    In cases where the system has been restarted following a restore from backup, IP assignments may need to be defined. To review the assignments:

    1. Navigate to the Zenoss.resmgr page under Applications in the CC UI.
    2. Review the IP Assignments table.
      • If all of the services already have an IP assignment, continue to the next step.
      • If any service does not have IP assignment, add an IP assignment for that service. If Automatic assignment does not add the appropriate IP assignments, use a manual IP assignment.

Start Services & Metrics

  1. Start the services under HBase.
    HBase is the first set of services to start because OpenTSDB and the rest of the performance metric pipeline depends on the HBase services.
  2. Start the remaining services under Infrastructure.
  3. Start the services under Events.
  4. Start the services under User Interface.
  5. Start the zproxy service.
    ZProxy is the topmost service (this is named Zenoss.resmgr in most installations). Clicking the Start button for Zenoss.resmgr service in the UI provides the choice of starting just that single service, or that service and all of its children. Choose to only start the single Zenoss.resmgr service.
  6. Start the services under Metrics.
  7. Start the service(s) named zenhub.
  8. Start the remaining collector services.
  9. Log into RM and spot check data collection and events to verify everything is working properly.
Was this article helpful?
10 out of 10 found this helpful

Comments

Powered by Zendesk