AIOps as a modern cockpit, and why that matters

AIOps as a modern cockpit, and why that matters

Date: June 9, 2021

Category:

Author: BigPanda

Our human capacity for ingesting information and acting on it, is constant. As the systems we operate grow more complex, we need to make sure we use technology that presents us with only the relevant information we need, exactly when we need it. In aviation, this lesson was learned long ago, and now IT Ops is catching up.

Join us in a CTO Perspective discussion with Jason Walker, Chief Customer Officer at BigPanda and former Marine pilot, to find out exactly how IT Ops is following in the footsteps of the modern cockpit.

Read the skinny for a brief summary, then either lean back and watch the interview, or if you prefer to continue reading, take a few minutes to read the transcript. It’s been lightly edited to make it easy for you to consume it. Enjoy!

 


The Skinny

The fly-by-wire revolution back in the late 1980’s had many similarities to what is happening today with the adoption of AIOps in IT Ops. Back then it was understood that humans could no longer physically process and properly react to the gigabytes of information thrown at them by flying machines that were growing more and more complex. Computers were needed to do most of the heavy lifting, while providing pilots with only the right amount of data and information they needed to safely fly the aircraft to its destination.

In the same way, running monitoring and incident management across today’s hybrid enterprise services involves operating multiple complex systems functioning simultaneously. AIOps is needed to make sense of all the data they create, automate most of the work, and provide operators with just the specific information they need to make critical decisions. AIOps is assisting the human OODA (Observe, Orient, Decide, Act) loop – and if we can understand its purpose in that context – we can better plan and execute AIOps adoption.

Read on or watch the interview to learn exactly how.

 


The Interview

 


The Transcript

Yoram: Hello, and welcome to the CTO perspective, where we discuss unique perspectives about the most current issues in IT operations. And today we’ll be talking, once again, with Jason Walker – Field CTO at BigPanda. Hey, Jason.

Jason: Hey, Yoram. Good to see you.

Yoram: Great seeing you again! Today, we’re going to be venturing a little off the main road, or maybe above the main road. We’re going to try to better understand the adoption of AIOps by comparing it to the surprisingly similar process that happened in aviation around three or four decades ago. And that’s the adoption of fly by wire: computer assisted flight. And who better to talk about this comparison than two, I would say, IT Ops enthusiasts who are also former pilots, right? What are the chances?

Jason:  It is rare to meet another helicopter pilot. You flew the Israeli version of the CH-53, and I flew the U.S. Marine version. Both older aircraft, and my version was about halfway through that fly by wire revolution.

Yoram: So, let’s take that perspective and talk about the change that we discussed: fly by wire. Somewhere in the mid to late 80s, aircraft manufacturers understood that planes that fly faster, safer, and take more people to greater distances, need to rely on computers to help them fly. The amount of information that was being thrown at the pilots by these aircraft, by their systems, was overwhelming. There were simply too many gauges.

Jason: Absolutely. And it was beyond the capability of a pilot to keep up with all those caution lights, all those different gauges, all those different sensors in mission systems. The amount of information coming in required a shift in cockpit design.

Yoram: So, what was the change? What was going on behind the scenes?

Jason: Behind the scenes, aviation engineers were working on a philosophy of less is more.

And I think that is a key point for IT operations to take into consideration. They realized that pilots did not need all the information that the systems could give, all of the sensor information that’s out there, but just the subset, and the contextualized subset, that was relevant at a specific time. If you think about something like an altitude indicator, what it could do is provide very accurate and second-to-second updates on what altitude you’re at. But you don’t need that when you’re at 50,000 feet. You need it when you’re at 200 feet and descending at 2,000 feet per minute. And that was manifested in what’s commonly known as “Bitchin’ Betty”, which was the “Altitude! Altitude! Pull up! Pull up!” warnings that every pilot knows only too well. These were delivered directly into your headset, and you could hear the aircraft telling you what you needed to do at the precise moment that you needed to do it. And it was very clear and concise, and much better than any investment in more granularity around altitude information. Now, if you expand that out to the entire suite of information that is available in a modern aircraft, that really brings home the point that that’s just one tiny subset. That’s not event tactical information. It is just flying the aircraft, a very basic function nowadays. And so, yes, they applied that same philosophy to almost everything else in the cockpit, and you saw the cockpit shrink down and the field of view expand. You could actually see outside better. Much better in a modern helicopter for sure.

Yoram: So, the more complex the machine, the more complex the mission, the more complex the systems that you must manage – the less complex the display should be. The less complex the amount of information that you need to get. You need to get only the relevant information, and at the specific time that you need it, and let the systems do all the rest for you. That’s the way to create situational awareness.

Jason: Absolutely. Because the one thing that isn’t changing is the human factor. You’ve got your eyes, your ears, and your hands, and that’s what you’re using to interpret the world around you. You do that OODA loop: Observe, Orient, Decide, Act as fast as you can.

But humans are not evolving as fast as the aircraft technology and the mission technology. These are getting rapidly more complex and doing more and more things. We can process that information only at the speed of the human mind, which is constant.

Yoram: The OODA loop you just mentioned. For people who don’t know what that is, can you quickly explain?

Jason: The OODA loop: Observe, Orient, Decide, Act. It’s basically how humans make decisions. You see what’s going on. You orient that into your contextual awareness of everything you’ve previously experienced and what you know about the current situation. Then you decide what to do, and then you execute on that decision.

Yoram: So, the OODA loop is constant in the human being. The human capability to do something is constant, and as the systems grow more complex you have to make sure the technology that you’re using simplifies it to adopt to a human OODA loop. If we take that to the IT Ops world, adopting AIOps is actually the modern cockpit for NOCs and IT Ops teams. You need to do those same processes in IT operations.

Jason: Absolutely, what you have is a bunch of complex systems, often hundreds of thousands of different devices out there, that all need to function well for your services to run properly – for your customers, for your users, for your backend services. And they’re sending you information and you need to decide what to do with it.

The way that information is presented, and how much of it is presented, is critical. If you want to rapidly get ahead of the incident management cycle and detect something sooner, before it has a critical impact on an end user, then you have to be very fast with your OODA loop.

You have to get critical information packaged in the right way, contextualized in the right way. And you have to eliminate all the extraneous information from those modern APMs, NPMs, and all the rest of those monitoring systems. They’re very good at creating data that maybe isn’t relevant at the time. They can create almost “big data” volumes in terms of the events and the signals that they’re giving you. Very little of it is operationally relevant. We used to talk about the bulk of telemetry data that we had, back when I was running IT operations. We had this large bulk of available information, and our job was to get to the thin stream of operationally relevant information and present it quickly and in a timely way to our operations engineers.

Yoram: So, AIOps is not a certain specific capability or feature… It manifests itself in the end as features and capabilities, but the basic concept of AIOps is minimizing information to only the critical, relevant data needed at a certain critical time, to be presented to the operator so he can act as needed.

Jason: Absolutely. And it deduplicates, it removes the noise. It gives you a place to manage that data, normalize it, prepare it so that it can be presented. And then all of those intermediate steps, all of those administrative tasks that I would compare to flying aircraft: maintaining heading and altitude and air speed. It automates those. It automates the transfer of that topology data to your collaboration systems.

Yoram: Anything that’s not critical, that can be done automatically, should not be done by a human. The human being should be “reserved” for the specifics of what he’s good at, what he needs to be able to do, what the machine cannot do.

Yoram: Absolutely. Anything that is routine and repeated continually, you want to automate that because you’ve already dialled in that process or procedure or administrative task to a point where you don’t need to do that manually anymore. You know exactly what’s required.

I need to know with every alert that comes in, what service is it related to? What run book is related to that service or that particular alert, and what applications are underneath it, what hosts are underneath it, and what the network situation is.

Yoram: But I don’t need to know anything else below that, all the rest of the stuff…

Jason:  Right. I don’t need to know what the metrics looked like leading up to that point or what all the variables involved were. And I definitely don’t need to manually transfer that information from, let’s say, a CMDB into an alert payload. I want that done automatically because I do that every single time and I do it the same way every single time. The same way a pilot needs to maintain altitude, airspeed every single time he goes up in the air.

So, you just automate those processes that are routine and well understood, and then the humans are left to focus on what’s really important. Hey, what’s the root cause? What is the impact? What is the urgency of this? What priority should I set? Those are human decisions, judgment calls.

Yoram: I could actually go on talking about these topics, both my favourite subjects. But maybe we can close by asking what we can learn from the adoption process of fly-by-wire. It took quite some time, right? There was apprehension, there were glitches. What can we learn from how we adopted fly-by-wire, in AIOps and ITOps? What knowledge can we implement here?

Jason: I would say there are really two things we can take and transfer into the world of IT operations from that that fly-by-wire revolution.

Number one is you need trust, first and foremost. Your human operators need to understand what your automated system, what your AIOps system, is doing.

Pilots experienced this. When they first went to fly-by-wire, they did not trust it one little bit. And there is a history of crashes and mishaps associated with moving to that next generation of technology. They had to work the kinks out. And it took a long time to establish that trust. And really it was by training pilots on “here’s how it works” and presenting them with all of the information that was going on in the background in fly-by-wire systems, that built the trust for pilots. So, they finally were able to say: yes, I can still fly the aircraft with this, without the cables connected to the controls and without the steam gauges.

The second thing is that when you need to make fast, accurate decisions to prevent bad things from happening, less is more when it comes to information. You need the right information, at the right time and packaged in the right way so you can really act on it decisively and accurately.

Yoram: Well, that’s interesting, I think this is a great place to end. I want to thank you so much for this very enjoyable conversation.

Jason: Thanks a lot, Yoram it was great talking to you again, and I’m sure we’ll catch up again soon.

Yoram: I’m sure, too. And if you want to learn more about the BigPanda platform or