Oleksiy Gayda's Blog: TechTip: Understanding the availability reports in Nagios Core

Nagios is an incredibly powerful tool for monitoring all and any network-connected equipment (and I do mean all and any - if it has a network interface and provides any sort of information on that interface, it can be scrubbed, monitored and reported on by Nagios through a number of readily available or custom-written scripts). However, GUI implementation in Nagios Core (the free community version of the product) could use some serious usability and readability improvements.

One of the most confusing yet more important (due to high likelihood of management visibility) components of Nagios Core is availability reporting. I'm not going to go into detail on how to generate an availability report since it is pretty straightforward (Login, click "Availability" under "Reports" menu, select report type, qualifier, pick some dates and you're done). You will end up looking at a report showing your selected hosts and four status percentage columns:

This is what the columns mean:

Host - straightforward, name of the host in Nagios
% Time Up - How much of the time the host was actually up (this is your standard availability value)
% Time Down - How much the host was actually down
% Time Unreachable - How much time the host was unreachable (different from "Down" in that this could be caused by a network outage resulting in the host not being reachable due to unavailability of its parent hosts, but not actually being down itself).
% Time Undetermined - Percentage of time when Nagios did not have complete log data on the device and therefore cannot provide any statistic on its uptime for that timeframe (can be caused by an addition of a new device, for example).

Time Undetermined and Numbers in Brackets

Also with regards to the "% Time Undetermined" column, you will notice that Nagios shows two values for each of the other metrics, one "normally" and one in brackets - 86.200% (86.200%). If the "% Time Undetermined" is zero, then these two numbers will always match, however should you have any instances where Nagios did not have log data available for the specified time period, you will see it reflect in the other columns. So for example, with recently added devices, the number in brackets will show the actual system uptime, while the "normal" number will assume that the device was simply down. Obviously in a situation like this, the number you want to show to your boss is the one in brackets.

Scheduled Downtime

Another little-publicized caveat of Nagios availability reporting is that it will actually include downtime from any "Scheduled Downtime" you had within the report time period. This can be very misleading and upsetting to your management, for example, if you schedule a system maintenance and work with users to make sure that they are not impacted by the work you do, but according to the availability report - you had downtime!

This unfortunate "feature" is "by design" and there is no way to change/disable that. The only workaround available at the time of writing (with Nagios Core at version 3.3.1) is to disable all checks on the host before bringing it down into a scheduled downtime:

You can still use the scheduled downtime for record keeping and communication with your team, but it has no effect on your availability report. Also, don't forget to re-enable the checks when you're done with the host!

Oleksiy Gayda's Blog

Categories

TechTip: Understanding the availability reports in Nagios Core

No comments: