Determining Status and Reachability of Network Hosts


Monitoring Services on Down or Unreachable Hosts

The main purpose of Nagios is to monitor services that run on or are provided by physical hosts or devices on your network. It should be obvious that if a host or device on your network goes down, all services that it offers will also go down with it. Similarly, if a host becomes unreachable, Nagios will not be able to monitor the services associated with that host.

Nagios recognizes this fact and attempts to check for such a scenario when there are problems with a service. Whenever a service check results in a non-OK status level, Nagios will attempt to check and see if the host that the service is running on is "alive". Typically this is done by pinging the host and seeing if any response is received. If the host check commmand returns a non-OK state, Nagios assumes that there is a problem with the host. In this situation Nagios will "silence" all potential alerts for services running on the host and just notify the appropriate contacts that the host is down or unreachable. If the host check command returns an OK state, Nagios will recognize that the host is alive and will send out an alert for the service that is misbehaving.

Local Hosts

"Local" hosts are hosts that reside on the same network segment as the host running Nagios - no routers or firewalls lay between them. Figure 1 shows an example network layout. Host A is running Nagios and monitoring all other hosts and routers depicted in the diagram. Hosts B, C, D, E and F are all considered to be "local" hosts in relation to host A.

The <parents> option in the host definition for a "local" host should be left blank, as local hosts have no depencies or "parents" - that's why they're local.

Monitoring Local Hosts

Checking hosts that are on your local network is fairly simple. Short of someone accidentally (or intentially) unplugging the network cable from one of your hosts, there isn't too much that can go wrong as far as checking network connectivity is concerned. There are no routers or external networks between the host doing the monitoring and the other hosts on the local network.

If Nagios needs to check to see if a local host is "alive" it will simply run the host check command for that host. If the command returns an OK state, Nagios assumes the host is up. If the command returns any other status level, Nagios will assume the host is down.

Figure 1.

Remote Hosts

"Remote" hosts are hosts that reside on a different network segment than the host running Nagios. In the figure above, hosts G, H, I, J, K, L and M are all considered to be "remote" hosts in relation to host A.

Notice that some hosts are "farther away" than others. Hosts H, I and J are one hop further away from host A than host G (the router) is. From this observation we can construct a host dependency tree as show below in Figure 2. This tree diagram will help us in deciding how to configure each host in Nagios.

The <parents> option in the host definition for a "remote" host should be the short name(s) of the host(s) directly above it in the tree diagram (as show below). For example, the parent host for host H would be host G. The parent host for host G is host F. Host F has no parent host, since it is on the network segment as host A - it is a "local" host.

Figure 2.

Monitoring Remote Hosts

Checking the status of remote hosts is a bit more complicated that for local hosts. If Nagios cannot monitor services on a remote host, it needs to determine whether the remote host is down or whether it is unreachable. Luckily, the <parents> option allows Nagios to do this.

If a host check command for a remote host returns a non-OK state, Nagios will "walk" the depency tree (as shown in the figure above) until it reaches the top (or until a parent host check results in an OK state). By doing this, Nagios is able to determine if a service problem is the result of a down host, an down network link, or just a plain old service failure.

DOWN vs. UNREACHABLE Notification Types

I get lots of email from people asking why Nagios is sending notifications out about hosts that are unreachable. The answer is because you configured it to do that. If you want to disable UNREACHABLE notifications for hosts, modify the notification_options argument of your host definitions to not include the u (unreachable) option. More information can be found in this FAQ.