Overview

The challenge in monitoring an environment like a Eucalyptus cluster is that it is always changing. Virtual machines are created and destroyed all the time. When virtual machines are running, we want to monitor them. When they no longer exist, we want to stop monitoring them. And most of all, we don't want to constantly alter the configuration of our monitoring system by hand to add and remove these hosts and their affiliated checks. This is where OMD shines, because we can combine the utility of Nagios eventhandlers with the ability of Check_MK to (re-)inventory hosts, rebuild Nagios object configuration, and reload Nagios. The result is a dynamic system that always knows what to monitor, and what not to monitor.

'Check_MK inventory' Eventhandler

  • The first step is to set up an eventhandler that can respond to a situation in which the service check "Check_MK inventory" discovers a new service.
    • ( $USER4$ is a Nagios custom macro defined in $OMD_ROOT/etc/nagios/resources.cfg -- it corresponds to the value of $OMD_ROOT itself )
    • Nagios has lots of built-in Macros you can use inside your Nagios configuration.
  • Check out our example config file from code.renci.org SVN:
  • cmk_reinventory.sh SVN source:

Adding VMs

  • ${OMD_ROOT}/local/bin/add_vm.sh
  • SVN source:
  • For this action, I chose to have this script get called by the cmk_reinventory.sh eventhandler directly. I originally wanted it to be an eventhandler itself, but there were several problems with that design -- most acutely, the realization that a new service changing from an unchecked state to an OK state does not trigger a Nagios event.
  • This script relies on the cmk_reinventory.sh script in conjunction with the 'qemu' check_mk check (which itself relies upon the '/usr/lib/check_mk_agent/plugins/mk_qemu' plugin to be installed on eucalyptus worker nodes). Taking advantage of the fact that each hosts autochecks are listed in ${OMD_ROOT}/var/check_mk/autochecks, we can parse that info for the names of KVM virtual machines, and add them to Check_MK (testing, of course, to ensure they don't already exist).

Removing VMs

  • In Nagios, a Host Check is always a ping check, and the responses are UP or DOWN depending on whether the host could be reached.
  • We want to define an eventhandler that is triggered by the DOWN state of a host, but only for hosts with the Check_MK tag 'vm'
  • If the host has a 'vm' tag, and is in a DOWN state, and is no longer listed as 'running' or 'pending' by euca-describe-instances, then we want to remove it from Check_MK's hosts.mk & ipaddresses.mk files, and reload Check_MK & Nagios
extra_nagios_conf += r"""
define command {
    command_name    del_vm
    command_line    $USER4$/local/bin/del_vm.sh $HOSTNAME$ $HOSTSTATE$
}
"""
extra_host_conf["event_handler"] = [
	( "del_vm", [ "vm" ], ALL_HOSTS ),
]	
extra_host_conf["event_handler_enabled"] = [
	( "1", [ "vm" ], ALL_HOSTS ),
]

Cleaning after deleted VMs

*The thought behind this script was to remove a 'qemu' autocheck from a euca worker node, following the termination of a VM instance with the euca-terminate-instances command. After the VM goes away, it orphans that 'qemu' check (leaving it in an 'UNKNOWN' state) until the host is re-inventoried. It's like the opposite of the cmk_reinventory/add_vm.sh function.