Most IT and Ops professionals have a love-hate relationship with Nagios. On the one hand, Nagios gives you near real-time visibility into the inner workings of your IT infrastructure. With Nagios, you can tell which host is low on memory, what services are consuming too many CPU cycles, and which applications are taking too long to grant access. You also get alerts early enough that you can solve problems before they impact end users.
That’s the theory, anyway. In practice, Nagios can end up causing as many problems as it solves.
Okay, let me step back a moment. Nagios doesn’t actually cause problems, but it can make it harder for Ops teams to identify real problems. For you parents out there, here’s a way to think about it. 99% of the time, when our children cry, there’s nothing really wrong. They want attention, or because of their limited experience, an insignificant event seems like a big deal. Enough to cry over.
As parents, we know that a scraped knee needs only acknowledgement and a Band-Aid, and then the tears will dry. But in that moment of panic, your child may think she’ll never walk again. Or she may just want attention. Either way, parents know almost instantly whether they have a real problem on their hands or not.
Dealing with Nagios is much like dealing with a crying child, except for the fact that there is no way to easily distinguish between a scraped knee and a broken leg. With Nagios (and most monitoring systems, in fact), each alert could be the sign of a major problem on the horizon, or just a minor issue. We just can’t tell. We have no way to differentiate crocodile tears from real, painful anguish.
We have no way to tell whether that alert about low memory may be signaling an impending service outage or is just a blip on the radar, an anomaly that will go away on its own.
The Needle in the Haystack Problem
Here’s where we need to ask two important questions:
Why do alert floods happen at all? And why are things only getting worse every year?
The root of the problem actually lies in a positive aspect of alerting: automation. There is no way any single Ops pro, or even an entire Ops team, can manually parse hundreds of thousands of metrics to pinpoint unhealthy values. Nobody expects their Ops teams to stare at charts and point out things such as “CPU load on host #472 is awfully high” or “access time to our billing app from France appears to be off.”
Instead, we configure thresholds and delegate this work to Nagios. Then, Nagios goes through all of our checks and looks for metrics that exceed thresholds, alerting us when necessary.
So far, so good.
How, then, does such a big time saver transform itself into a productivity killer at precisely the worst time – during major outages?
The answer is simpler than you might expect: modern applications are designed in a way that buries us under a flood of noisy alerts. It’s how they’re built.
Here are the three main culprits:
1. Servers. Modern applications have moved off of in-house, behind-the-firewall servers and into the cloud. Today, we rely on distributed computing. We load-balance services, shard databases, and cluster queues. As a result, a single application is no longer tied to a single powerful server. Instead, it is supported by tens, sometimes even hundreds or thousands, of instances. When the application or service is experiencing an issue, the problem manifests itself as issues on many of the supporting instances.
What we get is hundreds of alerts, but they all point to just one issue, even though each one seems like a separate problem.
2. Microservices. In the past decade, the idea of monolithic software was retired in favor of a service-oriented architecture. A single business application now consists of many services communicating with one another, and this problem will continue to intensify over time as more and more developers create open APIs. This has allowed companies to move much faster by improving scalability, cohesion, and maintainability.
Unfortunately, this also means that a single problem is likely to impact many services. An issue in one service can escalate to adjacent services, gradually creating a cascade of alerts from dozens of services.
Yet, which service is the root cause? In a massive alert stream, it can be impossible to tell.
3. Agile Development. Today, engineering teams must align their goals with top-level business goals. This shift means that we now see fewer and fewer year-long, technically-rich projects developed in isolation. Developers opt for shorter iterations, guided by ongoing customer feedback. Unfortunately, this impacts our ability to maintain accurate and up-to-date monitoring configurations. By the time we are done configuring thresholds and hierarchies, our application has already changed. Over time we accumulate large quantities of meaningless checks or checks with dated thresholds.
How, though, do you distinguish legacy noise that should be ignored – or even suppressed – from pressing problems that could lead to downtime or outages?
Of course, the evolution away from monolithic server-based applications is a positive development, but, unfortunately, our monitoring and alerting is still stuck in the old behind-the-firewall paradigm.
Clearly, it’s time to modernize.
Using Data Science to Tame Nagios
This is where BigPanda comes in. BigPanda intelligently clusters and correlates your Nagios alerts into high-level incidents, so you can spot critical issues faster. Rather than manually wading through thousands of Nagios alerts, pages, and emails by hand, you can now view them in a unified way, clearly separating the signal from the noise.
BigPanda’s aggregation engine normalizes alerts from Nagios and all of your other monitoring systems into a single unified data model. Then, BigPanda uses cluster analysis and statistical learning to analyze your alert stream in real-time and group related alerts into high-level incidents that are much easier to detect and understand.
BigPanda can often reach 99% correlation between alerts with very high levels of accuracy. This is what Ops teams need – intelligent alert correlation – to keep up with today’s sprawling, out-of-control IT infrastructures.
Anything less will leave you exposed to threats, misconfigurations, downtime, and outages because the real problems get buried under the flood of noisy alerts.
Stay tuned for Part 2, where I’ll discuss in more depth how you can add intelligence to your Nagios alerts through alert correlation.