How To Configure Zenoss for High-Availability Using DRBD and Heartbeat (Resource Manager 4.1.x)

IMPORTANT NOTE: As of December 2013, Zenoss recommends that all new Resource Manager installs and any upgrades of Resource Manager that involve an upgrade of the host operating system to RHEL Centos 6.4 or later follow the procedure outlined in KB 16052-186 "How do I Configure Zenoss Resource Manager for use with Red Hat Cluster Suite?" instead of following the procedure documented here.


  • root or sudo access to the servers
  • access to repositories to obtain the DRBD and Heartbeat components
  • additional pre-requisites as described in procedure below

Applies To

  • RPM based installs on CentOS or RHEL only
  • Zenoss 4.1.x


Zenoss can be setup in a highly available (active/passive) configuration rather easily by making use of common components of the Linux HA project including heartbeat and DRBD. DRBD (Distributed Replicated Block Device) mirrors devices much like a RAID 1 configuration; however, the mirroring is done across the networking. DRBD is used to provide two servers a constantly synchronized copy of the MySQL and ZODB data. The heartbeat service is used to handle the bringing up/down the slave (passive) node when the master fails.



Certain assumptions have been made to make this article as widely applicable as possible. The following conventions will be used throughout the article, and should be replaced with your local settings. These instructions are targeted primarily to CentOS 5, but with some minor modifications can be applied to most Linux distributions.

    primary node: hostnameA
    secondary node: hostnameB

IP Addresses:
    primary node:
    secondary node:
    shared cluster:

Physical Block Devices:
    /opt/zenoss: /dev/sda2
    /opt/zenoss/perf: /dev/sda3


You need to have the following in place before you can start configuring the system for high availability. All of these steps should be performed on both servers that are destined to be in the highly available cluster.

File System Layout

You will need three separate file systems for the best performing setup.

  • / - at least 8GB - stores operating system files and will not be replicated.
  • /opt/zenoss - at least 50GB - stores most Zenoss files and will be replicated.
  • /opt/zenoss/perf -  at least 10GB - stores Zenoss performance data and will be replicated.

After booting up for the first time, you should unmount the latter two file systems as we will be repurposing them for replication. You can do this with the following commands.

umount /opt/zenoss/perf
umount /opt/zenoss

You should also remove these two file systems from the /etc/fstab file so that they won't be automatically remounted upon boot:

LABEL=/opt/zenoss       /opt/zenoss             ext3    defaults        1 2
LABEL=/opt/zenoss/perf  /opt/zenoss/perf        ext3    defaults        1 2

Disk Replication with DRBD

You will need to install the drbd and kmod-drbd packages to enable disk replication for the cluster. The names of these packages can differ from one Linux distribution to another, but the following will work to install them on CentOS 5 (tested with CentOS 5.7).

Note: These packages are provided by the CentOS Extras repo, you will need that repo to install drbd82

yum install drbd82 kmod-drbd82

You should now configure DRBD to replicate the file systems.

NOTE: The format below is a shorthand notation available only in DRBD 8.2.1 and later. If you are running a version prior to 8.2.1, please see the DRBD user's guide for the proper format. The guide can be found at Replace the existing contents of the /etc/drbd.conf file with the following.

global { usage-count no;


common { protocol C;

disk {
on-io-error detach; no-disk-flushes; no-md-flushes;


net {
max-buffers 2048; unplug-watermark 2048;


syncer {
rate 700000K; al-extents 1801;

} }

resource zenhome { device /dev/drbd0; disk /dev/sda2; meta-disk internal;

on hostnameA {


on hostnameB {

} }

resource zenperf { device /dev/drbd1; disk /dev/sda3; meta-disk internal;

on hostnameA {


on hostnameB {

} }

Now that DRBD is configured we need to initialize the replicated file systems with the following commands.

NOTE: You must ensure that the ports that you used in your drbd configuration are open or that your firewall is disabled or the connection will not be made between the two servers.

These commands should be run on both the Primary and Secondary Zenoss Server:

dd if=/dev/zero bs=1M count=1 of=/dev/sda2 dd if=/dev/zero bs=1M count=1 of=/dev/sda3 sync
drbdadm create-md zenhome
create-md zenperf 
service drbd start

These commands should be run on the Primary Zenoss server:

drbdadm -- -o primary zenhome drbdadm -- -o primary zenperf mkfs.ext3 /dev/drbd0
mkfs.ext3 /dev/drbd1
service drbd stop

Because it will not let you initialize the filesystems on the secondary server while the disk on the secondary is in the secondary mode you will need to stop drbd on the primary and make the secondary server the temporary primary to initialize the file system. Run 'service drbd stop on the primary server and than run the following commands on the secondary server to initialize the filesystems:

drbdadm -- -o primary zenhome drbdadm -- -o primary zenperf 
mkfs.ext3 /dev/drbd0
mkfs.ext3 /dev/drbd1

After initializing the filesystems on the secondary server you must set the secondary server back to being the secondary server:

drbdadm secondary zenhome 
drbdadm secondary zenperf

You must also set the primary server back to being the primary:

drbd start
drbdadm -- -o primary zenhome drbdadm -- -o primary zenperf

You can now mount these two replication file systems with the following commands on the primary server:

mount /dev/drbd0 /opt/zenoss
mkdir /opt/zenoss/perf
mount /dev/drbd1 /opt/zenoss/perf


RabbitMQ-Server must be configured before Zenoss will properly install. Here are the steps that are needed for setting up RabbitMQ

Step 1: Synchronize UID/GID for rabbitmq User

The rabbitmq-server RPM has the potential to install the rabbitmq user and group with a different uid/gid depending on what RPMs are already installed on the system. For this reason you look in the /etc/passwd and /etc/group files on both nodes of the cluster to verify that the rabbitmq user and groupd have matching uids and gids respectively. If you find that the uid/gid don't match, change them so that they do. After changing the uid and gid you must fix the ownership of RabbitMQ files that already exist on the system. This can be done with the following command:

chown -R rabbitmq:rabbitmq /var/lib/rabbitmq chown -R rabbitmq:rabbitmq /var/log/rabbitmq

Step 2: Make "service rabbitmq-server status" Quiet

Heartbeat normally determines if services under its control are running by executing a "service <servicename> status". If this status check has a zero exit code the service is deemed to be running, if it has a non-zero exit code, the service is deemed to be in some kind of a stopped or failed state. There is one important exception to this behavior. If the status command's text output contains "running" heartbeat will assume the service is running even if the exit code was non-zero.

What does "service rabbitmq-server status" print when RabbitMQ isn't running? Status of all running nodes...

Error: no_nodes_running

So this makes heartbeat think that the rabbitmq-server service is always running. Which means it will never try to start it. To fix this you must modify the rabbitmq-server startup script to be quiet when its status operation is called. You must edit /etc/init.d/rabbitmq-server. Find the following line:

status) status_rabbitmq ;;

Then add "quiet" after status_rabbitmq so that it looks like this instead:

status_rabbitmq quiet ;; 

Step 3: Add Cluster Alias to /etc/hosts

We're going to be changing the RabbitMQ nodename to make it work in a clustered environment. So we'll need to add an alias to the /etc/hosts file. Assuming the two nodes of your cluster are named something like zenoss1 and zenoss2 you should add a "zenoss" alias into /etc/hosts at the end of the existing line for

Step 4: Create Shared Storage for Persistent Queues

We use persistent queues in RabbitMQ to make sure that even if a server reboots, or RabbitMQ is restarted we don't lose events. These queues are persisted to disk so we need to make sure that when we failover to the other node of the cluster that it starts reading from the existing persisted queue. To do this, we locate the RabbitMQ data files on a shared filesystem: /opt/zenoss. Run the following commands to do this. This only needs to be done on the primary because it will be replicated to the secondary.

mkdir -p /opt/zenoss/rabbitmq/mnesia
chown rabbitmq:rabbitmq /opt/zenoss/rabbitmq/mnesia

Step 5: Configure RabbitMQ

Create a new file called /etc/sysconfig/rabbitmq and add the following lines to it. See "Step 3" for what the RABBITMQ_NODENAME should be set to. It should be rabbit@ followed by whatever you chose for the alias.

export RABBITMQ_NODENAME="rabbit@zenoss"

export RABBITMQ_MNESIA_BASE="/opt/zenoss/rabbitmq/mnesia"

You must now restart the rabbitmq-server service:

service rabbitmq-server restart 

Then setup the Zenoss user, vhost and permissions again. Be sure to use your nodename for the -n parameter to rabbitmqctl.

rabbitmqctl -n rabbit@zenoss add_user zenoss zenoss 
rabbitmqctl -n rabbit@zenoss add_vhost /zenoss
rabbitmqctl -n rabbit@zenoss set_permissions -p /zenoss zenoss '.*' '.*' '.*' 

Finally you should verify that the rabbitmq-server service is in /etc/ha.d/haresources (in the case of heartbeat) and that it appears before zenoss on the line. You should also verify that the rabbitmq-server service is not set to start automatically on boot. The heartbeat daemon will handle starting it when appropriate.

chkconfig rabbitmq-server off

Zenoss Setup

Everything is now in place for the Zenoss installation. You should perform a normal Zenoss installation according to the regular instructions. You will need to install the Zenoss RPM on both servers, the zen packs only need to be installed on the primary server as the data in /opt/zenoss will be replicated across the two servers.

NOTE: Ensure that you following the pre-install steps on the secondary server (disabling the firewall, installing JRE, and installing the required packages and turning on the required services)

Zends Setup

After installing Zends you will need to set it up to work in the replicated environment. Use the following commands on the primary:

service zends stop
mv /opt/zends /opt/zenoss
ln -s /opt/zenoss/zends /opt/zends 
service zends start 

On the secondary server after installing zends you will want to log-in and remove the /opt/zends directory that is created and than create the symbolic link so that Zends will use the replicated zends data in /opt/zenoss/zends :

service zends stop
ln -s /opt/zenoss/zends /opt/zends service zends start 

Post Installation steps

After the installation is complete you should reset some permissions that were affected by the file system setup:

chown zenoss:zenoss -R /opt/zenoss/perf

Ensure that the mysql data files are not pushed down to remote collectors (you can skip this step if you dont use remote collectors).

Navigate to $ZENHOME/ZenPacks/ZenPacks.zenoss.DistributedCollector-VERSION.egg/ZenPacks/zenos/DistributedCollector/conf and add the following exclusion to exfiles:

 - mysql 

Heartbeat Setup

To setup heartbeat to manage your resources you must first install the package. The mechanism to install the package differs depending on your Linux distribution, but the following will work on CentOS 5.7:

    yum install heartbeat

Note: There is a CentOS bug which causes the initial install to fail, when you run the install a second time it completes successfully.

You may then need to install the heartbeat service in case this was not done for you automatically.

chkconfig --add heartbeat 

You must then configure heartbeat to know what your resources are so that it can properly manage them.

Create /etc/ha.d/ with the following contents:

# Node hostnames node hostnameA node hostnameB 
# IP addresses of nodes ucast eth0 ucast eth0 
# Enable logging use_logd yes debug 1
# Don't fail back to the primary node when it comes back up
# NOTE: Set this to "on" if you want Zenoss to automatically migrate back to # the primary server when it comes back up.
auto_failback off

To secure communication between the cluster nodes, you should create /etc/ha.d/authkeys with the following contents:

auth 1
1 sha1 MySecretClusterPassword 

Heartbeat requires that this file have restrictive permissions set on it. Run the following command to set the proper permissions:

chmod 600 /etc/ha.d/authkeys

To use heartbeat you will need a virtual IP set-up on your primary system, this IP will be the IP that is used in the /etc/ha.d/haresources file that we will go over in the next step. This set-up varies by distribution however with CentOS and RHEL the process is the following:

Copy the contents of /etc/sysconfig/network-scripts/ifcfg-eth0 into a file called /etc/sysconfig/network- scripts/ifcfg-eth0:0.

Change the “DEVICE=” line to be eth0:0 and change the IP to be that of the virtual IP. Restart networking and verify that you can ping the new virtual IP.

Then you should create the /etc/ha.d/haresources file with the following contents:

hostnameA \
drbddisk::zenhome \
Filesystem::/dev/drbd0::/opt/zenoss::ext3::defaults \
drbddisk::zenperf Filesystem::/dev/drbd1::/opt/zenoss/perf::ext3::noatime,data=writeback \ IPaddr:: \
zends \
rabbitmq-server \

Preparing for Cluster Startup

Now that your cluster is fully configured, you must shut down all of the resources to get ready to prime the master, then start the cluster for the first time. This can be done with the following commands.

service zends stop
service zenoss stop
umount /opt/zenoss/perf umount /opt/zenoss
drbdadm secondary zenhome drbdadm secondary zenperf service heartbeat stop

Starting the Cluster

These instructions apply only to the primary cluster node unless otherwise noted. They only need need to be performed to start the cluster for the first time. After that, the heartbeat daemon will manage the resources even in the case of node reboots.

Run the following commands on the primary node to make it the authoritative source for the replicated file systems:

drbdadm -- --overwrite-data-of-peer primary zenhome 
drbdadm -- --overwrite-data-of-peer primary zenperf 

Run the following command on the primary node to start heartbeat and start managing the shared resources:

service heartbeat start

Once you confirm that Zenoss is up and running on the primary node. You can run the same command on the secondary node to have it join the cluster. Your cluster is now up and running. The secondary node will take over in the event of a failure on the primary node.

You should now add heartbeat to the system's init script so that heartbeat starts when the system reboots, you can do this running the following command on both the primary and secondary:

chkconfig --level 2345 heartbeat on

Usage & Operation

Migrating Resources

The best way to manually migrate Zenoss to the currently inactive cluster node is to stop heartbeat on the active node. You can do this by running the following command as the root user on the active node:

service heartbeat stop

If you have auto_failback set to off in your /etc/ha.d/ you should immediately start the heartbeat service on this node after you confirm that Zenoss is running on the other node. If you have auto_failback set to on, you should on start the heartbeat service again when you want Zenoss to be migrated back to this node.

Checking the Cluster's Status

There are a number of commands that you should be aware of that allow you to check on the status of your cluster and the nodes and resources that make it up.

To check on the status of the DRBD replicated file systems you can run the following command:

service drbd status

On the primary node of an active cluster you should expect to see the following results from this command. The important columns are: cs=Connection State, st=State, ds=Data State.

m:res     cs         st     ds     p     mounted     fstype
0:zenhome Connected Primary/Secondary UpToDate/UpToDate C /opt/zenoss ext3 1:zenperf Connected Primary/Secondary UpToDate/UpToDate C /opt/zenoss/perf ext3 

You can run a similar command to check on the general health of the heartbeat service. service heartbeat status.

You can use the cl_status tool to get more detailed information about the current state of the cluster. Here's are some examples of its usage:

[root@hostnameA ~]# cl_status hbstatus

Heartbeat is running on this machine.

[root@hostnameA ~]# cl_status listnodes


[root@hostnameA ~]# cl_status nodestatus

hostnameB dead

[root@hostnameA ~]# cl_status nodestatus 

hostnameA active

[root@hostnameA ~]# cl_status rscstatus



This section outlines some common failure modes and the steps required to correct them.

DRBD Split-Brain

It is possible for the replicated file systems to get into a state where neither node knows exactly which one has the authoritative source of data. This is known as a split-brain. You can resolve it by picking the node with the older, invalid data and running the following commands on it:

drbdadm secondary zenhome
drbdadm -- --discard-my-data connect zenhome drbdadm secondary zenperf
drbdadm -- --discard-my-data connect zenperf 

Then running the following command on the node with the newer, valid data:

drbdadm connect zenhome drbdadm connect zenperf

MySQL Database does not start after a failover

If the MySQL Database does not start after performing an HA failover, log files from /var/log/mysqld.log may show the following:

090825 12:11:08 InnoDB: Starting shutdown...
090825 12:11:11 InnoDB: Shutdown completed; log sequence number 0 440451886 090825 12:11:11 [Note] /usr/libexec/mysqld: Shutdown complete 

090825 12:11:11 mysqld ended 
100330 22:52:21 mysqld started
InnoDB: Error: log file ./ib_logfile0 is of different size 0 524288000 bytes
InnoDB: than specified in the .cnf file 0 5242880 bytes!
100330 22:52:21 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.0.45' socket: '/var/lib/mysql/mysql.sock' port: 3306 Source distribution
100330 22:52:50 [ERROR] /usr/libexec/mysqld: Incorrect information in file: './events/heartbeat.frm' 

As noted in,106192,247923#msg-247923, open the my.cnf file on the system and add the following line:

innodb_log_file_size = 524288000

Updating the hub host to point at a floating ip or hostname

As the zenoss user at at command line enter into the zendmd python interpreter:

>>> dmd.Monitors.Hub.localhost.hostname = "MY_FQDN OR IP" 
>>> dmd.Monitors.Hub.localhost._isLocalHost = False
>>> commit() 
Was this article helpful?
0 out of 0 found this helpful


Powered by Zendesk