IT environments are increasingly hybrid, complex and constantly changing. IT Ops teams are constantly scaling their tools, people, and processes to address this. In the process, they are creating a lot of moving parts that need to work together well, yet are evolving at different speeds. This gap in maturity is often overlooked, as teams focus on operational KPIs, but it is probably the factor which exacerbates operational challenges the most.
In this CTO Perspective, we’ll discuss with Jason Walker, Chief Customer Officer at BigPanda, why IT Ops teams should prioritize maintaining a common maturity across all their operations, and how best to do that.
Read the skinny for a brief summary, then either lean back and watch the interview, or if you prefer to continue reading, take a few minutes to read the transcript. It’s been lightly edited to make it easy for you to consume it. Enjoy!
IT Operations teams operate in three main areas: monitoring, incident management, and awareness (how much they know about their services – their topology, changes in their environments, environment diagnostics). If an IT organization advances in any one of these three areas quicker than it does in the other two, their operations suffer. For example – an IT Ops team may have good monitoring sources that are creating millions of alerts, and also possess the tools with the capacity to ingest all these alerts, but they might the lack capability to enrich the ingested data. This will lead them to create thousands of non-actionable incidents, which then create a huge workload that can cripple them.
In essence, ITOps is as strong as its least mature component. To improve a poorly performing incident pipeline, it is therefore often not enough to focus on operational KPIs. Teams need to take a step back, assess the maturity of the different parts of their organization, and work to mature their weaker spots. How should they do that? Watch the interview to find out.
Yoram: Hello and welcome to the CTO perspective, where we discuss unique perspectives about the most current issues in IT operations. Today we’ll be talking about a very interesting topic called the ITOps maturity model. And we’ll be doing that with Jason Walker, Chief Customer Officer at BigPanda. Hey, Jason.
Jason: Hey Yoram, good to see you again.
Yoram: It’s great seeing you, always very eye-opening talking to you. So, the ITOps maturity model. Let’s dive into that. IT environments are increasingly hybrid, complex, very dynamic and constantly changing. ITOps teams are constantly scaling their teams, people, processes and their tools to address this. And in the process, they’re creating a lot of moving parts that need to work together well. The way that they measure how well they are working together is by using operational KPIs, such as MTTX, etc. But it seems that they’re missing something. And that is the fact that as they’re developing and scaling the different parts of their IT operations, each part is maturing at a different rate. And that’s creating an issue, isn’t it?
Jason: It is, because to do ITOps well, you must be good at multiple things. And really, I would divide these things into three major areas. First is monitoring. Second is incident management, and the processes you use around that. And the third is something I’ll call awareness. I think everybody knows what monitoring is: the amount of coverage and the quality of that coverage, as well as your event processing capability. Incident management has been well covered in other places and organizations know hot mature there. The third one, often less addressed, is awareness: how much do you know about your services – mainly your topological awareness, your change awareness, and your diagnostic awareness.
And if you advance along any one of these three areas quicker than you are moving in the other two, and get ahead of yourself so to speak, it will cause problems for your ITOps.
Yoram: And why is that? Why is it causing problems when you’re not on the same level of maturity in each of these three pillars?
Jason: Think about a weightlifter who has to do a full clean and press: if he only develops his legs and tries to lift something heavy and his upper body can’t take it, he’s going to fold in half. In the ITOps world, you have a ton of good monitoring sources and you’re ingesting millions and millions of events telling you all sorts of things. If you lack the capability to deal with that quantity, or automate the incident management flow, or to enrich that data – then you’re going to create tons of non-actionable incidents, which then create a huge workload. And so, you’ll start stumbling over yourself, all because you’re not in pace with all the different areas that you need to be good at.
Yoram: So you’re as fast as your slowest part in the pipeline. The slowest part, the most immature part, is weighing you down.
Jason: Absolutely. If you have a very manual incident management process where nothing is automated, data has to be passed manually from, let’s say, an alert payload into a ticketing system – you are going to get bogged down with that very administrative function. And then all your assignments, all your escalations, maybe some of your automated context gathering, are going to fall over – if you can’t keep pace with that upstream ingestion of monitoring events.
Yoram: You developed a maturity model for these three pillars, with several stages for each one.
Jason: Yes, we used a standard maturity model and broke it up for ITOps in each of those major areas.
You’re either reactive, responsive, proactive, which is where things start to get better, and then semi predictive, semi-autonomous, which is really the peak maturity we see out there for most organizations today. And then there is the Nirvana state of autonomous operations. And most organizations are aspiring to that. Or maybe they’ve hit it with one or two services.
Yoram: So, five stages from zero to the Nirvana, as you said. How do you assess which state you’re in, in each of the three pillars?
Jason: For each of the three pillars, we have a graphic that describes where you are in each of the phases of maturity. And then we point to a few specific metrics. For event processing, for instance, the key metric is incident volume. It’s not MTTX. It’s how many incidents are you running relative to your scale as an organization.
Yoram: And for the other two pillars, just so we get a feel for it?
Jason: Sure! For incident management processes, incident actionability is one metric that you look at. MTTX is another, of course, everybody looks at that. And then another one, especially when you get to the higher phases with incident management, is average incident priority. Because as you get more predictive, that metric should be declining. You should have a higher ratio of low priority issues. And then on awareness, the key metric is the data volume that you’re ingesting as part of your operations pipeline.
Yoram: Once you’ve established what stage you’re in in each of the three areas, what would be the next step? What you need to do?
Jason: You want to look for your outliers.
Where you are ahead the most – stay there. And then that draws your attention to where you are behind the most.
If, for example, your incident management process is completely manual, or undocumented, or inconsistent, then let’s work across people, process and technology to bring that up one or two notches. Gradually you get more automated. Gradually you get much more proactive. And so on. If you have no topological awareness or very little of it, go out and collect that information from all the different owners of that of that information, and the tools that that they are running to collect it.
Yoram: So what you’re saying is a bit different, I guess, than the instinctive way of looking at IT operations. You’ve got an issue. You have a lot of downtime or you’re having issues with service availability and the quality of service. What you’re saying is: take a step back and see how mature you are in each of the three areas. Don’t look at the end result. Rather, take a look at how things are working together within your organization and put your prioritization in the place where you are least mature. You need to grow up first, in order to be able to deal with what you’re doing. Leave the operational KPIs aside for a second. Deal with the gaps first.
Jason: Yes, because those gaps will hold you back significantly.
If you have a choke point in your operations pipeline where everything slows down, then you have to deal with that choke point and open it up, and make it function well before you can move any more through that pipeline.
Yoram: When customers hear about this maturity model, is this something they connect to? Is it difficult to get them to adopt it?
The times we’ve brought it in front of customers, or different organizations, they’ve immediately looked at it and said: Oh, that makes sense.
Or – yes, I am actually a little embarrassed to say, I’m way back here with my automation around incident management processes. That’s all manual right now. And we’re just fighting that with bodies. When asked: hey, how’s your monitoring coverage? they often answer: well, about 30 percent of our incidents are detected by monitoring. Now that’s a little low. So, we point out that the other 70 percent is what they’ve optimized around and tell them to look at improving that 30 percent before doing anything else. And that’s a very pragmatic, sensible approach to make organizations function better as a whole rather than buying off on any specific area as the solution to everything.
Yoram: And how does BigPanda fit into this maturity model? What can it offer to customers?
Jason: I think people succeed and fail with tools to different degrees, based on how they use them. BigPanda can improve visibility immensely for an operations pipeline. BigPanda puts analytics around the entire pipeline so that you can look at it as a whole, and it provides tools to improve things where you might have gaps. Across enrichment, correlation, automation, the big three – BigPanda gives you a single pane of glass to do all that if you don’t already have one. And tying those together and then measuring the results. Very powerful for any organization.
Yoram: So BigPanda creates visibility, helps you understand where you are in each of the stages in the maturity model, and then also helps you solve the gaps in those places.
Jason: Yes, because you’ll pull those metrics out immediately and know where you are in a certain area. You’ll quickly pick up on weak points in your overall pipeline. Overarching, you will start looking at it as a pipeline rather than as an assortment, a collection of different tools that you’ve just glued together. Now, it is one continuous pipeline that that delivers either efficient or inefficient ITOps, and good or bad service availability. And you can start to address those right away.
Yoram: By also using BigPanda tools.
Jason: Oh, absolutely. It’s hugely powerful once you start using it for that. But remember, it’s always dependent on good inputs, just like everything else.
Yoram: For sure! I think we’re at time – so this is a great place to end. This has been very enlightening. Thank you so much for your time.
Jason: Thank you. It was good talking to you again.
Yoram: Great talking to you, I’m sure we’ll do it again soon. And if you want to hear more CTO perspectives or learn about IT operations and AIOps, please visit us at BigPanda.io.