What’s an incident?
An issue triggered by some unexpected behavior that has an adverse impact on people, process, or things. They’re often symptoms of larger problems and can frequently be remediated by routine tasks (reboot… reconnect… restart).
Our goal in IT, however, isn’t getting credit for fixing issues we created… it’s managing healthy infrastructure that doesn’t suffer from a high volume of incidents. MTTR-driven ops management often misconstrues a large number of incidents resolved quickly to indicate a productive team when in fact it more frequently indicates fragile infrastructure.
What does it mean to resolve one?
Resolving incidents is considered positive… when in fact resolving them the right way the first time is what should be valued. MTTR rewards turning red to green. Other metrics like MTBF (mean time between failures) are better indicators of infrastructure that remains consistently healthy.
Is it always better to resolve incidents quickly?
Measuring reduced downtime alone is the IT equivalent of dipping the pacifier in cognac. The kid stops crying quickly but dad (mom would *never* exercise such bad judgment) may end up in prison. Reward thoroughness. Reward quality. Reward service. Don’t reward the cognac solution.
So what is MTTR?
It’s the starting point for a discussion about operational excellence. Its value varies from organization to organization and it’s one of many indicators of healthy process and infrastructure. It’s best calculated as the sum of all periods when every incident was in a state other than “resolved” divided by the total number of incidents – where duration is calculated based on machine timestamps (vs. operator-supplied status changes) using monitoring data and frequently reopened (or flapping) incidents are treated as a single incident.
Consider this less an unprovoked assault on IT doctrine and more an invitation to spend 30 minutes with your team evaluating whether or not MTTR reduction is the metric best aligned with business value.