apm digest
What happens when an IT outage is followed by hundreds of Nagios alerts?
- Important alerts fall through the cracks
- False alarms divert your attention from real issues
- Seeing the big picture becomes impossible
host_statusDuring these alert floods, alerting becomes practically useless. Most operations teams go as far as ignoring their Nagios alerts entirely during major outages. Instead, they focus on the one alert that appears to be most urgent at that moment.
Frustrated with noisy alerts, many people these days are shopping for an alternative to Nagios. Something that will give them deep and granular visibility across their infrastructure, but that isn’t so noisy. However, the switching costs are very high and most Nagios alternatives don’t offer a fundamental improvement along the spectrum of granular visibility verses noisy alerts.
But…It is actually possible to eliminate alert floods altogether, without ripping out Nagios. More and more companies are now using alert correlation on top of Nagios to fight alert overload and boost their production health.
The roots of the problem actually lie in a positive aspect of alerting: automation. It is unthinkable to manually parse through hundreds of thousands of metrics looking for unhealthy values. Nobody expects their infrastructure operators to stare at charts and point out things such as “CPU load on host #472 is awfully high” or “access time to our billing app from France appears to be a bit off.”
Alert Correlation to the Rescue
What is Alert Correlation and how can it help? The best way to understand it is to look at an example.
Consider a MySQL cluster with 25 hosts. Some of these hosts have been experiencing high page-fault rates, and a few others complained about low free memory. In 30 minutes, we received more than 20 individual alerts. Your Nagios dashboard now looks like a circus. Your email inbox looks even worse.
There is a right way to look at alerts. In this particular case, we would have preferred to see just a single incident. That incident would group together all of the cluster’s memory and page-fault alerts, allowing us to stay in control even during the alert flood.
Further more, by correlating these alerts together, we can easily distinguish between alerts belonging to this incident, and other similar alerts, such as storage issues on the MySQL nodes, or a global connectivity issue experienced by the datacenter. Alerts are such as these often drown in the alert flood.
Among the benefits of using BigPanda are:
Efficiency – BigPanda’s algorithm is capable of reducing up to 99% of your alert load, while remaining highly accurate.
Custom Rules -BigPanda allows you to configure custom rules for special correlation use-cases.
Full Stack – In addition to Nagios, BigPanda can also consume alerts from other monitoring tools, such as New Relic, Splunk, Pingdom and many others.
Learn more and try it free at bigpanda.io.