When you buy a point solution to monitor your e-commerce site or to track application performance, you’ll trap yourself in that monitoring silo. This array of monitoring tools encourages a bottom-up approach to monitoring that brings to mind the cautionary fable about several blind men and an elephant. Each blind man touches a different part of the elephant and is asked what the thing is. One man holding the tail believes it’s a rope. The man holding a leg believes it’s a tree trunk. The one touching an ear thinks it’s a hand fan.
In other words, by focusing too closely on minute details, the bigger picture is obscured, if not lost entirely.
This approach is exactly backwards from what’s needed to ensure overall system health. IT professionals should decide what needs to be measured and how based on top-level business goals – how can you reach your goals, after all, if they aren’t clearly defined – but in a sprawling monitoring environment, the tools too often decide for us.
Monitoring Sprawl is Worse than you Think
Unfortunately, many people, even those within IT, don’t realize just how out of control monitoring sprawl has become. Back at the dawn of the Internet era, monitoring was easy. The typical organization only had a few servers to monitor, and to do so, you simply turned to IBM, BMC, HP or one of the handful of other monitoring providers.
That was it. Job done.
Today, the landscape is much, much different. Today’s IT organization must cope with a hodge-podge of legacy servers, cloud-based apps, complex websites, e-commerce tools, mobile apps – the list goes on and on.
And while “app sprawl” was a key tech buzzword a few years ago during the height of the virtualization movement, people forgot that app sprawl would inevitably lead to monitoring sprawl.
Today’s complex monitoring landscape looks something like this:
Complexity isn’t necessarily the problem, though, or at least not the only problem. There are plenty of complex systems we rely on that work and work well – look at the immune system, for instance – but most of those tend to be adaptive systems. These systems have built-in intelligence that allows them to learn from experience and adjust to changing conditions on the fly.
In contrast, in our complex monitoring environments, too much is static, and too many alarms are delivered with zero context. Thus, when something breaks or even misfires, the event could trigger an alarm, or it could not. If the problem, bug, or vulnerability is new, it may well evade detection. And each alert that is triggered could be important . . . or it could be just noise.
In fact, constant streams of alerts are quickly becoming the IT equivalent of spam, and they are ruining IT’s ability to respond to real problems in real time.
What can we do to change this?
A Top-Down, KPI-Driven Alternative
One way to get a handle on sprawling monitoring tools is to determine a top-level goal for the entire system. Rather than focusing on errors, alerts, and anomalies, IT teams should first determine what exactly you are trying to do.
A brick-and-mortar retail store will have much different goals than Netflix, and those goals can be translated into KPIs (Key Performance Indicators). As you define KPIs to guide monitoring decisions, you may make some unexpected discoveries about where to focus your attention.
Netflix, for instance, learned that a single KPI trumped all others. Back in the early days of Netflix streaming (around 2008), the company attempted to gather metrics on everything.
But manually tracking hundreds of metrics for thousands of servers that were streaming video to millions of end-user devices just did not scale. That led Netflix to shift its focus back to their top business goal: delivering a superior viewing experience. With that goal in mind, they asked themselves how they could best measure that. After digging through some options, Netflix soon realized that one KPI indicated the state of the viewing experience better than all others: the simple act of clicking play.
Netflix found that there were normal viewing patterns for each day, time of day, region, etc. If viewers were clicking play too often, something was wrong. If they weren’t clicking play often enough, something was probably wrong then too. Netflix learned that if they focused on “start per second,” they could quickly tell whether or not something was amiss. SPS is the KPI Netflix leans on.
Here’s how Netflix explains the importance of SPS:
We have streamlined production operations and improved availability by creating a single directional metric that indicates service health: SPS. We have experimented with and used a number of techniques to derive additional insight from this metric including threshold-based alerting, exponential and double exponential smoothing, and bayesian and stream mining approaches. SPS is the pulse of Netflix streaming, focusing the minds at Netflix on ensuring streaming is working when you want it to be.
Another advantage of a KPI like SPS is that it is easy to communicate across the organization (especially when you contrast SPS to Load or some other low-level metric). By defining a top-level KPI like SPS, not only are you focusing on the right thing, but you also create a lexicon that is a fit for all stakeholders, from the most junior NOC engineer all the way up to the CEO.
What is the SPS equivalent for your business?
Figure that out, and you’ll go a long way towards taming monitoring sprawl.
Remember, monitoring is first and foremost about your business. It’s about ensuring that your IT infrastructure aligns with business goals. In order to keep your eye on the big picture, in order to stay focused on your top-level business objectives, think about the big picture first. Otherwise, you may think you’re grabbing a deep-rooted tree trunk, when what you really have on your hands is an elephant that’s about ready to stampede.