Is ITIL Dead or Alive? Here’s the CTO Perspective
Our CTO and Field CTOs spend a substantial amount of their time talking to customers, industry leaders and domain experts, gaining extraordinary insights and subject matter expertise on the most current and burning issues in IT Operations. Listening to them speak at events, webinars, prospect meetings and even as part of random watercooler chats, we always thought it was a shame we didn’t have a platform to share their wisdom on a broader scale.
That’s why we decided to create “The CTO Perspective” video interview series.
In each of these short online interviews, we’ll discuss a top-of-mind topic in IT Operations, get the perspective of our CTO or one of Field CTOs on that topic, and share it with you in what we hope is an interesting and insightful format.
In this interview – our second one in the series – we talk to BigPanda’s CTO Elik Eizenberg about the role of ITSM and ITIL in today’s hybrid cloud environments, or as we called the interview: “ITIL – Dead or Alive?”.
Lean back and watch the interview, or if you prefer reading, take a few minutes to read the transcript. It’s been lightly edited to make it easy for you to consume it. Enjoy!
Yoram: Hello and welcome to the CTO perspective, where we discuss unique perspectives about the most current topics in IT Operations. And here today with us is Elik Eizenberg, CTO and co-founder at BigPanda. Hi, Elik.
Elik: Hi, Yoram. Good to be here again.
Yoram: Great to have you. Always a pleasure talking to you. Today, we’re going to be talking about a specific subject in hybrid cloud and the enterprise transformation, and that is the role of ITSM and ITIL. So let’s dive directly in. Would it be fair to say that ITSM and ITIL evoke strong emotions in IT Ops professionals on both sides of the spectrum? Some live by it, some maybe hate it, but nobody is really indifferent?
Elik: Yeah, a hundred percent. I would say this: ITIL and ITSM, as a philosophy, as a collection of processes and tools, have probably been one of the biggest and best things that ever happened to IT Operations and service management. They started about 20, 25 years ago and really introduced processes that allowed large companies and enterprises to run their operations in a way that’s responsible, in a way that provides good quality of service, and allows the business to move fairly fast.
But, 20, 25 years is a long time and a lot has changed. The people have changed, the tools have changed. And more importantly, how we run applications and infrastructure has changed: we moved to the cloud, we’re using CI/CD pipelines, we’re doing containers. All of that has changed how we run applications. And that means that ITSM is now not necessarily the best fit for these environments in its current form.
I’ve spoken to IT veterans who’ve been managing data centers for 20, 30 years, and they clearly say that ITIL is the best thing they ever learned and implemented in their stacks. And then I’ve spoken to younger people, five years in the industry, and they say ITIL is the worst thing that ever happened in the IT operations space. And they would never touch anything that even resembles ITIL methodology. So absolutely a very polarizing framework right now.
Yoram: Yeah, and somewhat confusing, I would say. There are different stories out there of successes or failures of using ITIL and ITSM. You told me a story about an online retailer who had this really good ITSM foundation for their stores. And then things started to go awry when they wanted to offer digital services and online services.
Elik: Yeah, I think it’s a great example of how ITIL has been so successful for many companies, but now they’re running a little bit into a wall with this ITIL methodology. This is one of the biggest retailers in all of the U.S., and they have been using an ITSM toolset and an ITIL methodology for their in-store infrastructure. For their stores, they have point of sale systems, AC systems, electricity, logistics, all of that. And they have a CMDB, and Change Management, Problem Management, Incident Management – the entire suite of ITIL processes to manage this in-store infrastructure. And this has been very successful for them. It’s a fairly slow moving environment, very monolithic, and works really well. But now recently they’ve started migrating. In the last five years or so, they started migrating to the cloud to run their digital footprint. They have mobile apps. They have an e-commerce channel. And all of that is implemented in a cloud-native fashion, using continuous delivery methodology.
I was speaking to the head of IT Operations there and he said – “You know, we were really successful with ITIL in our in-store infrastructure. So we tried to migrate that or translate that for the e-commerce channel, into the online digital footprint that we have, and just failed miserably”. They told me, “we’re giving up and we need an entirely new methodology to be able to bring order to the chaos of our modern infrastructure.”
Yoram: So they’re probably not proponents of ITIL and ITSM… But what about the other example, on the other side of the spectrum, that you told me about, a big financial software developer?
Elik: Yeah, this one is a very interesting story. It’s a Silicon Valley company, very successful, developing a SaaS finance product. Obviously their staff is mostly people with 5 to 10 years in the industry. And they decided one day they were going to migrate to an “all-in” DevOps SRE methodology. They were going to do Continuous Delivery for everything. They literally had a sign on the wall saying “ITIL does not enter this building”. And you know, six months into this – what I would call an “experiment” – they found that quality of service actually went down, and MTTR went up. There was real degradation in the quality of their service, and they had to roll back to more traditional ITIL processes to be able to maintain the complexity of their infrastructure. So it’s very interesting to see how many companies that go to that extreme, actually roll back, and backtrack to more traditional processes.
Yoram: So you told us two stories. One about a company that tried to maintain its ITIL principles while moving to the cloud. That didn’t really succeed. And then another company that was trying to run away from ITIL and go to a DevOps SRE model, and that didn’t succeed as well. So which one is it? It’s really confusing. On the one hand, you don’t want to stay behind. You want to innovate. But on the other hand, if you try to turn on a dime and then “run away”, so to speak, from ITIL, you’re likely to fail as well. So what do you do?
Elik: You know, the way I think about it is this: the extremist’s response, which is kind of like developing an allergy to anything ITIL and ITSM, is not the right way to go. But there is a lot of merit to the ones saying that ITIL has to be adapted to fit the times. So the solution is really about taking the really good (and established) principles of ITIL that describe different types of processes and principles that ITIL brings to the table, but then changing how you implement them to fit the modern way infrastructure is being operated and implemented.
Yoram: Ok, so it’s not that ITSM and ITIL are “dead” or “alive”, so to speak, but they’re sort of being reborn. So what would you say, for example, if we look at the CMDB, which is always something for conversation… CMDB in ITSM, what does that represent and how do we do that in this new world?
Elik: The CMDB is one of the most important aspects of the ITIL methodology. It’s a database of all your assets and all the dependencies between your assets. And that really helps you drive things like impact analysis: whenever something happens, you want to understand which business services are being affected. So the idea of the CMDB is definitely the right one. And every company, whether modern or traditional, needs a CMDB.
The challenge with the traditional way of doing things is that it relies on manual human entries. So people are actually populating the database. Or sometimes it’s driven by auto-discovery jobs that run once a week or once a month. But, in a modern environment that’s cloud driven or container driven, things change by the minute, sometimes by the second. You have new servers coming up or going out every single minute, and you have new applications being deployed every single minute. And obviously, you cannot rely on human driven processes or auto discovery jobs.
So how do you solve that? Essentially, what you do is instead of relying on a single source of truth, like in the traditional approach, you actually piece together your topology view by combining different data sources. I’ll give you a couple of examples: you can take Change Management information from tools like Chef and Ansible and Puppet. Take that information as one source. Then you can go and actually take information from service maps from application performance monitoring (APM) tools like AppDynamics and Dynatrace, they have good service maps for application to server dependencies. You take that feed as well. Finally, another source could be virtualization environments. If you have a private cloud running in your vCenter in vSphere, that can give you the dependencies between physical infrastructure and virtual machines. So you take all these different signals, you combine them, and then you get a very good view of your topology and your service dependencies. But it’s real time, it changes with your infrastructure and doesn’t rely on a single source of truth. So the same idea of a CMDB, but implemented very differently.
Yoram: So no single source of truth, and another point that was sort of hidden in what you were saying: people don’t change the data. You have to automate it. You can’t rely on people to collect it manually.
Yoram: OK. And taking those principles, what about Change Management? What does that represent in ITSM and how do we do it differently in this new world?
Elik: So change Management has two roles, right? One role is around making sure that you introduce changes in a safe fashion, that doesn’t break your applications, as much as possible. And the second one is providing visibility around all the changes in your organization in case something does break. So you know what has changed and might have caused your infrastructure to break. In the modern environment, obviously, we’re not going to go through manual ticket-based processes to approve changes. Everyone wants to use fast delivery methods like CI/CD pipelines.
So what do you do? You essentially use CI/CD pipelines for testing, to take care of the first part, around not breaking things. You use the automated testing aspect for that. For visibility, that’s where it becomes interesting. What you essentially do is create a registry of changes, that again, pulls into the system different sources of changes. You connect to your orchestration tools – Ansible, Terraform, Kubernetes – and look at all the change feeds. You consume them into the central change registry. Then you look into your CI/CD tools – Jenkins, Bamboo CI, others – and take in all the change feeds and put them inside the same change registry. You’ll look at cloud audit tools, like (AWS) CloudTrail, for example, and look at all the changes in your cloud. So now you have one centralized view of all the changes. And then you use technologies like Machine Learning to easily scan that large feed of change data to understand what changed, whenever you have an incident. So, again, the same idea of Change Management, just implemented very differently for the modern infrastructure.
Yoram: Ok. So if I were to summarize: First of all, collect all the change feeds and do that automatically. So once again, it’s no single source of truth like in the CMDB, but a bit differently for change. And also do it automatically. And the second thing that you introduced is: use Machine Learning to correlate and understand, right?
Elik: Exactly. In the scale of modern data, you cannot rely on manual sifting through information, you have to rely on machine learning to be able to identify the records that you should care about.
Yoram: So basically, three main guidelines in doing ITSM in the new world or this new combination of ITSM in the DevOps SRE model: (1) No single source of truth, (2) data should not be reliant on people updating it, and (3) use ML to correlate and analyze.
Elik: I could not have said it better…
Yoram: So maybe this would be a good place to say a few words about BigPanda, which takes all the topics that we just discussed and implements them into an AIOps platform.
Elik: Yeah, absolutely. Big Panda is an AIOps platform that was architected from day one to support hybrid infrastructure. You have your traditional infrastructure, you have your modern infrastructure – and BigPanda essentially consumes change, topology and alert data from both these types of infrastructure. And then essentially enables the workflows we just discussed, across these two types of environments.
Yoram: Ok, I think we’ll let the viewers try that on their own. Thank you so much for talking to me today!
Elik: I really enjoyed this. Thank you, Yoram.
If you want to watch or read more CTO perspectives or learn about the BigPanda platform, please visit BigPanda.io.