Any post about digital transformation vs. slow, manual processes & legacy tools in the NOC.
It’s well known in IT operations that things don't break on their own. Close to 80% of production outages occur because of changes made by developers or someone in IT. However, this fact often eludes us when it comes to actually resolving production issues.
The last ten years have brought enormous changes to production environments, driven by a best-of-breed approach to production infrastructure enabled by open source and cloud. This has been a boon for developers in terms of flexibility and productivity, but it’s also placed a new set of challenges and expectations on Ops.
In many ways, incident management for devops is similar to typical issue tracking processes: it facilitates coordination and collaboration of daily tasks. For this reason, tools such as Jira, Zendesk, and even email are often used as solutions for incident management. But incident management faces one unique challenge that makes it different from other issue tracking processes. In addition to human-operated workflows, incident management also relies heavily on machine-driven workflows. Unfortunately, traditional issue trackers and ticketing systems cannot accommodate for this with their current product mechanics.
Few things damage productivity as much as waiting. Waiting forces us to context switch, disrupts our creative momentum and eliminates our ability to experiment. Whether we are deploying a new service or troubleshooting a problem, waiting puts a heavy tax on efficient work.
We engineers love measuring stuff. Whether it helps us solve an immediate problem, gets us ready for a bad day or just because most of us are information junkies, we love keeping track of metrics. The spectrum of what can be measured is very wide. It can include data from every part of our system: from technical metrics such as disk space or RPM, through UI metrics like page load times, to business KPIs such as revenue, conversion rates and so on. When choosing which metrics to collect, we usually start with the obvious ones: those that reflect the current state of the system (e.g., CPU, memory and load). There are quite a few articles and blog posts about these metrics, so I’m not going to discuss that here. Rather, I would like to focus on metrics that reflect the user experience.
Here are the four metrics that we at BigPanda see as the most important in this category: