Monitoring Service and Host Clusters


Introduction

Several people have asked how to go about monitoring clusters of hosts or services, so I decided to write up a little documentation on how to do this. Its fairly straightforward, so hopefully you find things easy to understand...

First off, we need to define what we mean by a "cluster". The simplest way to understand this is with an example. Let's say that your organization has five hosts which provide redundant DNS services to your organization. If one of them fails, its not a major catastrophe because the remaining servers will continue to provide name resolution services. If you're concerned with monitoring the availability of DNS service to your organization, you will want to monitor five DNS servers. This is what I consider to be a service cluster. The service cluster consists of five separate DNS services that you are monitoring. Although you do want to monitor each individual service, your main concern is with the overall status of the DNS service cluster, rather than the availability of any one particular service.

If your organization has a group of hosts that provide a high-availability (clustering) solution, I would consider those to be a host cluster. If one particular host fails, another will step in to take over all the duties of the failed server. As a side note, check out the High-Availability Linux Project for information on providing host and service redundancy with Linux.

Plan of Attack

There are several ways you could potentially monitor service or host clusters. I'll describe the method that I believe to be the easiest. Monitoring service or host clusters involves two things:

Monitoring individual host or service cluster elements is easier than you think. In fact, you're probably already doing it. For service clusters, just make sure that you are monitoring each service element of the cluster. If you've got a cluster of five DNS servers, make sure you have five separate service definitions (probably using the check_dns plugin). For host clusters, make sure you have configured appropriate host definitions for each member of the cluster (you'll also have to define at least one service to be monitored for each of the hosts). Important: You're going to want to disable notifications for the individual cluster elements (host or service definitions). Even though no notifications will be sent about the individual elements, you'll still get a visual display of the individual host or service status in the status CGI. This will be useful for pinpointing the source of problems within the cluster in the future.

Monitoring the overall cluster can be done by using the previously cached results of cluster elements. Although you could re-check all elements of the cluster to determine the cluster's status, why waste bandwidth and resources when you already have the results cached? Where are the results cached? Cached results for cluster elements can be found in the status file (assuming you are monitoring each element). The check_cluster plugin is designed specifically for checking cached host and service states in the status file. Important: Although you didn't enable notifications for individual elements of the cluster, you will want them enabled for the overall cluster status check.

Using the check_cluster Plugin

The check_cluster plugin is designed to check the overall status of a host or service cluster. It works by checking the cached status information of individual host or service cluster elements in the status file.

More to come... The check_cluster plugin can temporarily be obtained from http://www.nagios.org/download/alpha.

Monitoring Service Clusters

First off, you're going to have to define a new service for monitoring the cluster. This service will perform the check of the overall status of the cluster. You are probably going to want to have notifications enabled for this service so you know when there are problems that need to be looked at. You probably don't care so much about the status of any one of the services that are members of the cluster, so you can disable notifications in those those service definitions.

Okay, let's assume that you have a check_service_cluster command defined as follows:

define command{
	command_name	check_service_cluster
	command_line	/usr/local/nagios/libexec/check_cluster --service /usr/local/nagios/var/status.log $ARG1$ $ARG2$ < $ARG3$ 
	}

Let's say you have five services that are members of the service cluster. If you want Nagios to generate a warning alert if two or more services in the cluster and in a non-ok state or a critical alert if three or more are in a non-ok state, the <check_command> argument of the service you define to monitor the cluster looks something like this:

check_service_cluster!2!3!/usr/local/nagios/etc/servicecluster.cfg

The $ARG3$ macro will be replaced with /usr/local/nagios/etc/servicecluster.cfg when the check is made. Since this is the file from which the check_cluster plugin will read the names of cluster members, you'll need to create that file and add the services that are members (one per line). The format of a service entry is the short name of the host the service is associated with, followed by a semi-colon, and then the service description. An example of the file contents would be as follows:

host1;DNS Service
host2;DNS Service
host3;DNS Service
host4;DNS Service
host5;DNS Service
host6;DNS Service

Monitoring Host Clusters

Monitoring host clusters is very similiar to monitoring service clusters. Obviously, the main difference is that the cluster members are hosts and not services. In order to monitor the status of a host cluster, you must define a service that uses the check_cluster plugin. The service should not be associated with any of the hosts in the cluster, as this will cause problems with notifications for the cluster if that host goes down. A good idea might be to associate the service with the host that Nagios is running on. After all, if the host that Nagios is running on goes down, then Nagios isn't running anymore, so there isn't anything you can do as far as monitoring (unless you've setup redundant monitoring hosts)...

Anyway, let's assume that you have a check_host_cluster command defined as follows:

define command{
	command_name	check_host_cluster
	command_line	/usr/local/nagios/libexec/check_cluster --host $ARG1$ $ARG2$ /usr/local/nagios/var/status.log < $ARG3$
	}

Let's say you have six hosts in the host cluster. If you want Nagios to generate a warning alert if two or more hosts in the cluster are not up or a critical alert if four or more hosts are not up, the <check_command> argument of the service you define to monitor the cluster looks something like this:

check_host_cluster!2!4!/usr/local/nagios/etc/hostcluster.cfg

The $ARG3$ macro will be replaced with /usr/local/nagios/etc/hostcluster.cfg when the check is made. Since this is the file from which the check_cluster plugin will read the names of cluster members, you'll need to create that file and add the short names of all hosts (as they were defined in your host definitions) that are members (one per line). An example of the file contents would be as follows:

host1
host2
host3
host4
host5
host6

That's it! Nagios will periodically check the status of the host cluster and send notifications to you when its status is degraded (assuming you've enabled notification for the service). Note that for thehost definitions of each cluster member, you will most likely want to disable notifications when the host goes down . Remeber that you don't care as much about the status of any individual host as you do the overall status of the cluster. Depending on your network layout and what you're trying to accomplish, you may wish to leave notifications for unreachable states enabled for the host definitions.