Anyone with a corporate credit card can deploy a SaaS-based tool, often even connecting it through APIs to other critical apps, in minutes. Adding to the chaos is the fact that developers have numerous automated shortcuts at their disposal — such as Serena for app deployment and Testroid for mobile app testing — which dramatically compresses the time it takes to roll out a new app. Add this all up, and application and IT infrastructure is sprawling out of control.
The trouble is that the old way of managing applications and infrastructure is woefully out of date.
The old IT monitoring tools from the likes of IBM, BMC, and CA were designed for a different era. They were architected to handle dotcom-era infrastructure problems, and they just haven’t kept up with the scale and complexity of modern infrastructure or the accelerated pace of application development cycles. As a result, companies are using a variety of specialized monitoring tools — like Nagios, New Relic, Pingdom and Splunk — in order to get deep visibility into various layers of your stack. Unfortunately, those tools don’t work together in a coherent fashion.
In the past, when struggling with chaotic change, IT organizations and Ops teams used to turn to the ITIL Service Lifecycle model for guidance. ITIL originally emerged in the late 1980s and early 1990s to help organizations follow best practices and better align IT to business goals. However, it was developed in a quieter, less frenetic time, and even the latest iteration, ITIL 2011 edition, hasn’t quite adjusted to this new reality.
ITIL is still a solid framework, offering guidance on such important processes as IT service design, operations, and improvements, but Ops teams need more. Ops teams need modern methods to tackle modern problems. Otherwise, the alerts from various monitoring tools will keep coming in such large volumes that it will be nearly impossible to separate the signal from the noise.
The way forward – how to pinpoint signal in the noise
What enterprise Ops teams need today is a new approach. I’m not saying we should throw out ITIL. We should not, but we need to augment it with tools and techniques designed for today’s decentralized, rapidly changing infrastructures.
It’s no longer enough to simply get alerts. Ops teams get too many of them already. In fact, the steady stream of alerts is training Ops to ignore them. When there is so much noise, you will tend to mistake the infrequent signal for yet more noise. That’s not a good thing
Instead, Ops teams need holistic solutions with built-in intelligence. They need tools that automate the mundane, mind-numbing, error-prone tasks, tasks that have grown so numerous that they’re scaling beyond what is humanly possible to keep up with. At the same time, Ops also needs solutions that offer enough actionable intelligence that when you get an actionable alert (signal), you know what to do with it.
New incident-response management solutions are emerging to do just that, but various vendors take much different approaches and often leave out key features. I believe that what Ops teams need are solutions that take into account five key factors that legacy monitoring tools don’t cover:
1. Time. Ops teams need to know what is happening in real-time. Rather than getting snapshots that could be outdated by the time you see them, Ops needs to be able to easily see what exactly is happening now. One thing snapshots did well, however, was allow you to compare today with yesterday and a week ago. They gave historical context, and as you move to real-time intelligence, it’s important to maintain the context that gives your insights meaning. You need to be able to understand the larger significance of an event, determining whether or not it’s been building over time and gauging whether this is an isolated anomaly, a major risk, or something else entirely. You also need to be able to go understand all of that at lightning speed. We call this “time-to-insight,” and if time-to-insight isn’t fast, you’ll always be at least one step behind.
2. Level of importance. If you can’t prioritize which alarms are the most critical ones, you have no priorities. Ops needs guidance from the tools serving it. Ops needs to understand just how critical it is to address an alert right now, and what the risks are if they don’t.
3. Alert patterns. As you scrutinize alerts, can you see the forest for the trees? Do separate alerts actually have similar underlying causes? Does a cluster of alerts indicate a single, underlying problem? Are alerts that seem isolated at a glance actually signals of a much larger problem? Do two alerts that look similar at first glance actually signal very different problems? Without insight into what each alert means, both separately and also holistically, it can be impossible to know.
4. Automation and integrations. Repetitive manual tasks tend to introduce errors. Those tasks must be automated, and automated tasks must be integrated into the various tools serving Ops. We also need to start automating tasks that just don’t scale. Updating patches, managing configurations, and plugging vulnerabilities takes too long in our sprawling application environments to be handled manually. There just aren’t enough hours in the day. The only way forward is automation.
5. Insight into the big picture. There are no islands unto themselves in today’s IT infrastructures. An alert from one app could signal emerging problems in several others. Ops needs to understand how each piece of the IT puzzle fits together and affects every other piece. In today’s rapidly changing IT environments, this is much easier said than done, but that’s not an excuse. Big-picture insights are crucial.
Add the above five factors up, and you could boil this all down to two key ingredients that will boost Ops chances of maintaining control: time-to-insight and time-to-remediation.
Time-to-insight is crucial because if it takes too long to take figure out what an alert means, your users may already be flooding your help desk with calls. The longer the process takes, the more it costs your organization in downtime and lost productivity.
Once Ops has the right insight, though, then what? Knowledge is just one part of the equation. Now, you must act. Are you confident enough in the insight to act? Do you know how to proceed in order to solve the problem? Time-to-remediation is overlooked in legacy tools, but, remember, if we can’t measure it, we can’t improve it, and constant improvement is the only way Ops will keep pace with change.