Applies To
- Zenoss Control Center 1.1.x
Summary
This document outlines a procedure for destroying and recreating a Control Center thin pool.
NOTES:
- Before using this procedure on a production server, carefully review all alternatives.
- Read this document completely before starting this procedure to ensure that you have a basic understanding of the different steps.
Disclaimer:
- This document does NOT apply to cases where the docker thin pool is full.
- This document is specific to Control Center and above 1.1.x because Control Center versions 1.0.x do not use thin pools.
- While the high-level set of steps for this procedure are generally true for both HA and non-HA environments, many of the details of this procedure do NOT apply in an HA context.
- Although there are likely multiple ways to rebuild the thin pool for Control Center, this KB does not profess to be the only method. It provides a single method. Your Mileage May Vary.
Procedure
Before beginning this procedure, determine the available alternatives. For example, if the thin pool is 100% full and the kernel has switched the DFS into read-only mode:
- Have you tried deleting any application snapshots to free up additional space?
- Have you tried adding more storage to the Control Center volume group?
- Is there a VM-level snapshot or image from before the thin pool filled that can be restored and will possibly remove more of the application data to free up more space?
Summary of Procedure Steps
- Review the current storage configuration for the Control Center master because the storage settings in the new thin pool must be replicated.
- Perform a controlled shutdown and removal of Resource Manager so a fresh Resource Manager can be restored from a backup.
- Perform a controlled shutdown of Control Center so a controlled startup can be performed at the end of this procedure.
- Unmount all tenant application volumes on the Control Center master.
- Deactivate all Control Center devicemapper devices on the Control Center master.
- Destroy the logical volume and associated volume group.
- Create a new thin pool.
- Start Control Center and restore the last backup
- Perform a controlled startup of Resource Manager, correcting any problems that might arise.
NOTE: Many of the commands in this procedure must be run as the root user, or a user with sudo access. Instead of attempting to specifically identify individual commands, the examples assume root access.
Review Current Configuration
Four items must be identified on the Control Center master before proceeding
- The devicemapper device name for the DFS.
- The devicemapper base size for the DFS.
- The Control Center volume group.
- Whether the Control Center volume group shared with anything else.
Devicemapper Device Name
This is the value of the configuration setting SERVICED_DM_THINPOOLDEV.
Typically, this value is /dev/mapper/serviced-serviced--pool.
Note that in some cases it may be different, especially if the thin devices were created manually.
Devicemapper Base Size
By default, serviced-storage creates new thin pools with a base size of 100GB (or the value of the command line parameter -o dm.basesize=).
The vast majority of installations use 100GB has a base size. However, if the tenant device has been resized since the initial installation, then use the new size when creating the new pool.
If you know the size the device was resized to last, then you are done. You can verify the current sizes with:
df -h /opt/serviced/var/volumes
NOTE: The maximum size of each tenant volume under /opt/serviced/var/volumes is typically the base size.
Control Center Volume Group
By default, the Control Center volume group is named serviced. It is used exclusively by Control Center.
WARNING
In theory a single volume group could be created for Control Center and shared with something else. If the volume group is shared with something else, STOP the procedure. Proceeding with the rest of this procedure will result in the loss of the other data in that tenant volume group.
To verify the existing volume group, use the commands lsblk, lvdisplay, vgdisplay and pvdisplay to look at the relationships between the logical volumes, the volume groups and the physical volumes.
If the logical volume for serviced shares a volume group with some other logical volume, proceeding with procedure will destroy all data for that other logical volume.
Stopping Resource Manager and Control Center
If the pool is 100% full and the DFS has been changed to read-only, there is no effective monitoring occuring, so there is no harm in stopping Resource Manager and Control Center at this stage, More importantly, in the later stages of this process when restarting Resource Manager, it is much easier to do if you are starting from a clean slate. The best way to start from a clean slate is to do a controlled shutdown before attempting any recovery so that you can do a controlled startup once things are put back together.
- Stop Resource Manager and top-level applications
The Resource Manager application must be stopped before it can be removed. Use the following command to stop Resource Manager:
# serviced service stop Zenoss.resmgr
Wait for all applications to stop. Watch the status of services until all have a state of Stopped. The following command lists every service that is NOT stopped. It is safe to ignore the services that have no status because these are the “folder” services that do not run anything:
# serviced service status | grep -v Stopped
- Remove the Resource Manager Application
When the Resource Manager application is stopped, it can be removed from control center. The application will be re-created later when restoring the backup.
NOTE: If the Resource Manager application is not removed prior to shutting down serviced a known issue in Control Center 1.1.5 will cause Resource Manager to use the base devicemapper device and ultimately leads to the deletion of the base device and failure of serviced to start.
Use the Control Center UI command line to remove Resource Manager:
# serviced service remove Zenoss.resmgr
After Resource Manager is removed, perform a controlled shutdown of the Control Center master and all worker nodes. The procedure for stopping the Control Center master and the worker nodes is documented in the KB titled Zenoss Master - Staged Startup and Shutdown Best Practices for Maintenance.
When everything is stopped, it is a good practice to disable auto-start of Control Center. This is a preventative measure. If the host is rebooted later in this procedure, it might not be desirable for Control Center to automatically start up because it can activate/mount devicemapper (DM) devices in the thin pool before you are ready. Disable auto-start of Control Center:
# systemctl disable serviced
NOTE: When everything recovered, re-enable the Control Center master to restart at bootup:
systemctl enable serviced
Unmount all volumes
NOTE: Applies to the Control Center Master only.
Before rebuilding the thin pool, it is necessary to deactivate all devicemapper devices. Note that a devicemapper device cannot be deactivated if it has a filesystem mounted on it. It is necessary to manually unmount any volumes mounted on any Control Center devicemapper devices. After they are unmounted, the corresponding devices can be deactivated.
Each top-level tenant application in Control Center typically has two mount points:
- one for use by the Control Center master
- one that is exported via NFS for use by all resource pool worker nodes.
For a given tenant application, both mount points are on the same devicemapper device so they both reference the same data.
NOTE: Although many installations have only a single top-level tenant application (for example, Zenoss.resmgr), any given deployment of Control Center can have an arbitrary number of tenant applications.
Use the following command to list all of the active Control Center devicemapper devices:
# lsblk -ln -o NAME /dev/mapper/serviced-serviced--pool | tail -n+2
For each of the device names reported by the lsblk command, unmount all filesystems on that device and deactivate the device:
# umount -A /dev/mapper/deviceNameHere # dmsetup remove /dev/mapper/deviceNameHere
NOTE: The commands can be combined into a simple script:
# DEVICES=$(lsblk -ln -o NAME /dev/mapper/serviced-serviced--pool | tail -n+2) # for deviceName in "$DEVICES"; do umount -A /dev/mapper/$deviceName dmsetup remove /dev/mapper/$deviceName done
Verify there are no mount points on any of the Control Center /dev/mapper/* devices. Use one of the following commands to verify:
df -h
or
lsblk
or
grep serviced /proc/mounts
If you can not unmount the devices, the workaround of last resort is to reboot the box.
NOTE: Before rebooting, verify that Control Center is not configured to start at reboot:
systemctl disable serviced
Remove the Thin Pool
If it was necessary to reboot to force volumes to be unmounted or devices to be deactivated, re-verify there are no remaining mount points on any Control Center /dev/mapper/* devices after rebooting.Use one of the following commands to verify:
df -h
or
lsblk
or
grep serviced /proc/mounts
If mount points exists, it is likely that the Control Center master was automatically restarted on reboot, and it is necessary to repeat the steps above to unmount the volumes and deactivate the Control Center devices before proceeding.
NOTE: The following steps assume the default name for the Control Center volume group is serviced. This is the case if the original thin pool was setup using the serviced-storage tool.
- Remove the thin pool:
# lvremove -f serviced/serviced-pool
- Remove Control Center metadata about the thin pool
Control Center maintains metadata about each of its devicemapper devices in a set of files. The files are stored in the directory /opt/serviced/var/volumes/.devicemapper. These files must be removed to prevent Control Center from referencing stale devices when the new pool is created:
# rm -rf /opt/serviced/var/volumes/.devicemapper
Recreate the thin pool
Create a new thin pool using the following command:
-
# lvcreate -T --name serviced-pool -l 90%FREE serviced
Restore from Backup
Before restarting Control Center and restoring from backup, review the cron configuration of the Control Center master. In many cases, an automated backup might be configured . Because all Control Center backups and restores are serialized, if a backup is running, no restores can run. A corollary to Murphy’s Law dictates that if an automated backup scheduled to run, it will probably try to run in the middle of the restore/recovery operation.
Consult the cron schedule and temporarily disable any scheduled backup that might overlap with the restore procedure.
NOTE: The time to required to restore a backup and bring everything back online can require more than an hour. Plan accordingly.
Start the Control Center master using the procedure named “Start CC Master” in the KB titled Zenoss Master - Staged Startup and Shutdown Best Practices for Maintenance.
After the Control Center master is started, navigate to the Backup/Restore tab in the Control Center UI and restore your preferred backup. You can monitor the progress of the restore from the Control Center UI and/or the Control Center log file.
Controlled Resource Manager Startup
When the restore successfully completes, restart resource pool worker nodes and perform a controlled startup of Resource Manager using the procedure named “Restart Worker Nodes and RM” in the KB titled Zenoss Master - Staged Startup and Shutdown Best Practices for Maintenance.
Re-enable Suspended Operations
Re-enable the auto-start of Control Center on system reboot:
# systemctl enable serviced
If you suspended scheduled serviced backups to avoid interference with the restore operation, then re-enable the backups.
If necessary, adjust any static IP Assignments changed during the course of this restore process.
Comments