System monitoring allows us not only being alerted when certain parameters go out of tolerance or when system malfunctions occur in our infrastructures, but it also helps in analyzing resource trends and giving clues about present system problems.
Traditionally Nagios has been the “de facto” industry standard for IT infrastructure monitoring due, amongst other merits, to its flexible notification system, simple plugin design system and Open Source nature. However, the power and flexibility Nagios offers comes with a price: a steep learning curve and complexity in its setup and configuration.
Nowadays the Open Monitoring Distribution (OMD) comes to the rescue offering a pre-packaged Nagios system for a variety of GNU/Linux distributions. Building on top of Nagios and the Check_MK plugin ecosystem, it allows us to e.g. deploy a fairly complete monitoring solution for a medium HPC cluster with notifications, trend visualization, etc., in a matter of hours.
In this talk we will take a look at the basics of Nagios monitoring (types of monitoring, notifications, plugin system, etc.), its advantages and main problems and how the latter are solved, or at least greatly mitigated, by Check_MK and OMD. We will also examine the interaction between these three layers.