I’m a big fan of historical TV dramas and last week I finished watching the stunning and shattering HBO TV miniseries about the 1986 Chernobyl disaster.
As a monitoring expert and a product manager, I have visited dozens of IT operations centers, control rooms and NOCs, so I couldn’t help but compare them to the Chernobyl control room scenes in the show.
Putting aside all the intrigue, ethics and morals which played a key role in the tragedy, it was amazing to see that although 33 years have passed and nuclear power plant control centers are supposedly unrelated to IT operations in any way, the problems they faced were, and still are, very similar to those in today’s modern NOCs.
Chernobyl Control Room. Photo: Wikimedia
First – there is a distinct similarity in the organizational structure of the two. In both settings, there are operations engineers that are required to manage (monitor, control and optimize) production systems 24/7, they are both supervised by an Operations/NOC Manager, and they both escalate incidents to SMEs or SREs (L2s/L3s) when needed.
Next, and most strikingly similar, no matter how much time and effort are put into the design and build of these systems, it’s very hard to predict their behavior over time, after they go live and become critical to the business’s (or power plant’s) health.
Which leads me to my two most critical takeaways regarding IT operations:
1. First, as someone smart once said: “the truth is the light source of intelligence.” No matter the type of technology or how advanced it is, without a clear data-based understanding of an incident, you can’t surface root cause and remediation becomes difficult.
Operators need a tool like BigPanda Autonomous Operations Platform to compress alerts into enriched incidents, surface probable root cause and display the alerts in a coherent and accessible manner so that operators can handle them in an intelligent manner.
2. The second takeaway relates to the question that runs throughout the mini-series and is solved in the last episode: how could a fail-safe reactor, one that theoretically “could not explode”, ultimately explode?
Without spoiling the ending, I will just say that every machine, no matter how well it is designed and/or implemented, needs human monitoring and control, so that it can continuously adapt to its ever-changing environment.
In the same manner, while adding machine learning (ML) and AI into IT operations is crucial for success, it is not enough.
ML needs to be transparent, trustworthy and controllable – as is the case with BigPanda’s Open Box Machine Learning. BigPanda allows users to see the ML logic in plain English, test it and run what-if experiments, and add tribal knowledge that strengthens this logic before moving to production.
While the consequences of the Chernobyl breakdown were tragic and more disastrous than IT outages can conceivably be, it was very interesting for me to see how similar operational issues can be across different domains. And with all that said, if you haven’t seen the TV series yet, I highly recommend it.
If you want to learn more about BigPanda Autonomous Operations Platform, start out by watching this short product overview video.