If you work in IT Ops, you’ve probably been on the receiving end of a tsunami of Nagios alerts. It’s not pleasant.
What happens when an IT outage is followed by hundreds of Nagios alerts?
- Important alerts fall through the cracks
- False alarms divert your attention from real issues
- Seeing the big picture becomes impossible
During these alert floods, alerting becomes practically useless. Most operations teams go as far as ignoring their Nagios alerts entirely during major outages. Instead, they focus on the one alert that appears to be most urgent at that moment.
Frustrated with noisy alerts, many people these days are shopping for an alternative to Nagios. Something that will give them deep and granular visibility across their infrastructure, but that isn’t so noisy. However, the switching costs are very high and most Nagios alternatives don’t offer a fundamental improvement along the spectrum of granular visibility verses noisy alerts.
But…It is actually possible to eliminate alert floods altogether, without ripping out Nagios. More and more companies are now using alert correlation on top of Nagios to fight alert overload and boost their production health.
Too Much of a Good Thing
Why do alert floods happen at all? And why are things only getting worse every year?
The roots of the problem actually lie in a positive aspect of alerting: automation. It is unthinkable to manually parse through hundreds of thousands of metrics looking for unhealthy values. Nobody expects their infrastructure operators to stare at charts and point out things such as “CPU load on host #472 is awfully high” or “access time to our billing app from France appears to be a bit off.”
By configuring thresholds, we delegate this work to Nagios. Nagios goes through all of our checks, looking for metrics that pass thresholds, and alerts us when necessary. As stated, this is a good thing that helps us scale effectively.
Why, then, does such a big time-saver suddenly become a productivity-killer during major outages? The answer lies in the architecture of modern applications:
- Server Fleets – Modern applications rely on distributed computing. We load-balance services, shard databases, and cluster queues. As a result, a single application is no longer tied to one, powerful server. Instead it is supported by tens, sometimes even hundreds or thousands, of instances. When the application or service is experiencing an issue, the problem manifests itself as issues on plenty of supporting instances. What we get is hundreds of alerts, all representing just one issue.
- Micro Services – In the past decade, the idea of monolithic software was retired in favor of a service-oriented architecture. One business application now consists of many single-purpose services communicating with each other. This has allowed companies to move much faster by improving scalability, cohesion, and maintainability. Unfortunately, this also means that a single problem is likely to impact many services. An issue in one service escalates to adjacent services, gradually creating a cascade of alerts from dozens of services.
- Agile Development – Engineering teams are more driven by business use-cases than ever before. We see fewer and fewer year-long, technically-rich projects developed in isolation. Developers opt for shorter iterations, guided by ongoing customer feedback. Unfortunately, this impacts our ability to maintain accurate & up-to-date monitoring configurations. By the time we are done configuring thresholds and hierarchies, our application has already changed. Over time we accumulate large quantities of meaningless checks or checks with dated thresholds.
Of course the evolution described above is a great thing. The goal of alert correlation is never to undo or slow down these trends. Instead it seeks to find new, better ways to handle alerts – ways that would coexist in synergy with the new reality.
Alert Correlation to the Rescue
What is Alert Correlation and how can it help? The best way to understand it is to look at an example.
Consider a MySQL cluster with 25 hosts. Some of these hosts have been experiencing high page-fault rates, and a few others complained about low free memory. In 30 minutes, we received more than 20 individual alerts. Your Nagios dashboard now looks like a circus. Your email inbox looks even worse.
There is a right way to look at alerts. In this particular case, we would have preferred to see just a single incident. That incident would group together all of the cluster’s memory and page-fault alerts, allowing us to stay in control even during the alert flood.
Further more, by correlating these alerts together, we can easily distinguish between alerts belonging to this incident, and other similar alerts, such as storage issues on the MySQL nodes, or a global connectivity issue experienced by the datacenter. Alerts are such as these often drown in the alert flood.
Alert correlation is a method of grouping highly-related alerts into one high-level incident. To do this, it addresses three main parameters:
- Topology – the host or hostgroup that emits the alerts
- Time – the time difference between the alerts
- Context – the check types of the alerts
The Inevitability of Alert Correlation
Are there other ways to combat alert floods? One method commonly attempted by companies is alert filtering. Monitoring engineers define custom dashboards limited to a small set of alerts, designated as high-severity or sev-1 alerts. Such a dashboard is expected to be considerably less noisy than a full dashboard.
However, there are two major problems with alert filtering. First, it introduces a blindspot to your operational visibility. Often low-severity alerts are precursors to high-severity alerts. A CPU Load issue might quickly evolve into a full outage. By ignoring the low-severity issues, you are risking reacting to alerts only after they are already impacting your production. The second problem with filtering is that filtered dashboards become very noisy, very quickly. Looking at the MySQL example above, you would probably want to see all of the page-fault rate alerts in your high-severity dashboard. So even after eliminating the low memory alerts, you are still stuck with thirteen alerts in your new dashboard.
In contrast, with Alert Correlation you avoid alert floods without losing visibility. Once a company adopts alert correlation, it doesn’t need a high-severity dashboard anymore.
BigPanda ❤ Nagios
BigPanda is an alert correlation platform optimized for Nagios. It consumes your Nagios alerts in realtime, and uses an intelligent algorithm to process and correlate these alerts. The BigPanda dashboard is a cloud-based application that presents all of your Nagios alerts grouped together into high-level incidents.
Among the benefits of using BigPanda are:
- Efficiency – BigPanda’s algorithm is capable of reducing up to 99% of your alert load, while remaining highly accurate.
- Custom Rules -BigPanda allows you to configure custom rules for special correlation use-cases.
- Full Stack – In addition to Nagios, BigPanda can also consume alerts from other monitoring tools, such as New Relic, Splunk, Pingdom and many others.
Learn more and try it free at bigpanda.io.