Incident Management KPIs

Incident Management KPIs that really matter and how to improve them effortlessly

Tracking IT incident management key performance indicators (KPIs) is a vital step toward minimizing disruptions for customers and users. But metrics alone are not enough to drive improvement in incident management. Learn what KPIs to monitor and how to use them.

In this article:

What Is a KPI in Incident Management?

Key performance indicators (KPIs) show if a company is improving, how it handles incidents, and which are IT issues and disruptions. Incident duration, number of incidents, and average time to resolve are important metrics. By tracking KPIs, a business can see if it is moving toward its incident management goals. Incidents include hardware and software problems as well as quality issues. Engineers focus on incidents that interfere with IT services and the availability of the IT system to business users and customers. Their goal is to minimize the impact on users and get services back to normal as soon as possible.

Businesses typically monitor KPIs closely. Changes show whether
a team is making progress on an important goal.

What’s the Difference Between Metrics and KPIs?

While often used interchangeably, KPIs and metrics are different. KPIs are a type of metric that measures performance against critical business objectives. Metrics are also measurements, but they aren’t strategic. Metrics can be simple status and process indicators.

What Are the Most Important KPIs of Incident Management?

Top incident management KPIs are metrics that show impact on users. When a system is down, employees and customers cannot use it, and the business loses money. For this reason, KPIs focus on measuring the severity of impact and speed of recovery.

Top Incident Management KPIs

Impact Duration: The time between when an issue arises (possibly without anyone noticing) and when the incident ends. This KPI is the most important because it affects customer and user experience. For major incidents, the impact duration affects the availability of service. Some organizations look at impact minutes, a measure tied to the number of users. For example, if a global online game with an average of 2 million players online at once suffers a two-minute outage, that would be 4 million impact minutes. This metric makes the impact on user experience much more concrete.

Incident Duration: The time span between when the service team creates a ticket and the end of the incident and when the ticket is resolved. This KPI represents the effectiveness of your incident management process. Duration reflects how much of your service resources you are using and can be the key indicator in service-level agreements (SLAs).

Mean Time to Detect/Discover (MTTD): The average time between the onset of an issue and its detection. This number shows how long it takes a team to recognize that there is a problem.

Mean Time to Acknowledge (MTTA): The average time between an alert and the service team acknowledging it.

Mean Time to Action (MTTA): The average time from an alert to when the first action to fix it takes place. This KPI measures how quickly engineers see and respond to an incident.

Mean Time to Assign (MTTA): The average time from receiving an alert to assigning it to service staff. The KPI is typically the same as mean time to action. This metric indicates how long it takes to identify the proper owner of the resolution process, the team or person best equipped to deal with the issue.

Mean Time to Know or Diagnose (MTTK): The average time between an alert and when you know the cause of the issue.

Mean Time to Recover/Repair/Resolve (MTTR): The average time from receiving an alert to recovery—meaning the system is functioning again.

Mean Time to Remediate: More commonly used with IT security; this measures the average time between discovering a threat or vulnerability and closing the ticket.

Incident Volume: Use this KPI to measure how many incidents occur in a given period, such as a week, month, and year.

Uptime: The percentage of time that an application or system is available to users and fully functioning.

Incident management timeline and MMTX

Incident Management Timeline and MMTX

More Incident Management KPI Examples

Metrics help assess incident management and the health of the IT infrastructure. Here are examples of some more KPIs that service teams watch.

Alerts Created: The number of automated notifications generated by monitoring tools when certain conditions are present or trigger a threshold, such as CPU utilization.

Resolution within SLA: This measures the percentage of incidents fixed within the service level agreement.

Timestamp: The hour, minute, and second that an alert happened, or a user reported a problem.

Incident Count: How many incidents occurred in total over a specified period. A team might look at incidents of a specific type or related to a piece of hardware.

Incident Frequency: The rate of incidents.

Mean Time Between Failure (MTBF): The average time between outages.

Incidents Resolved on First Contact: The percentage of incidents resolved when the service desk has initial contact with a user, or a support team member touches the problem for the first time.

Number of Active Tickets: How many service tickets are in the IT service desk’s workflow between identification and closure.

Compression Rate: Use this metric to measure a team’s ability to tie multiple incidents to the same cause, so engineers treat the cluster as one incident. Tracking this number can increase efficiency. Number of Repeated Incidents: Number or percentage of incidents that are a recurrence of an issue.

Reopen Rate: The rate that tickets have to be reopened for further resolution.

Incident by Type: Incident counts according to categories. These include:

  • Major, moderate, or minor in terms of user impact
  • Service request, fault, automatic alert, upgrade
  • Repetitive problems that automated scripts and simple user instructions can resolve vs. repetitive problems caused by IT configuration or infrastructure
  • Levels according to problem complexity

Incident Backlog: Incidents that are waiting to receive attention.

Escalated Incidents: The number or frequency of incidents that are too difficult for the first responder to resolve.

Incidents with No Known Resolution: A measurement of incidents that have no solution identified.

Average Cost to Resolve: This number is the cost to resolve all incidents divided by the number of incidents. Use this metric to see the cost of the service operation.

Incident Management Customer Satisfaction: The average user rating of incident management service.

Setting Expectations in Incident Management

Incident management plays out against baseline expectations agreed to by a business and its customer or between the IT team and its users in the company. The components of these expectations are the following:

Service Level Agreement (SLA): An agreement in which the service provider promises the user certain performance. The SLA defines penalties or consequences for not delivering the results, such as a cost reduction for the customer. The user might have responsibilities, too, such as reporting issues promptly. The IT Ops team may monitor the number of incidents that breach SLAs.
Service Level Objective (SLO): SLOs are part of the SLA. These service objectives are the metric levels for determining if the SLA was achieved. Ideally, SLOs should be straightforward and easy to track. Common measures are uptime (how much of the time an application or service is operational), latency, throughput, and MTTR.

Service Level Indicator (SLI): The SLI measures compliance with SLOs. If your SLO is an uptime of 99.95 percent and you achieved 99.96 percent, your SLI is 100 percent.

A commonly cited goal for uptime is “five nines,” or 99.999 percent, which is about 5.5 minutes of downtime a year. A more common real-world SLA for uptime is 99.95 percent.

How to Choose Which Incidents to Measure

Networks typically generate a deluge of system indicators. IT service managers can improve their operations by focusing on meaningful incidents and the best KPIs. These incidents require action and have relevant, reliable, and real-time data.

Incidents that are meaningful to measure are:

  • Actionable: The incident requires action to prevent some level of business impact.
  • Comprehensive: You have all causal and symptomatic alerts that stem from the root cause.
  • Contextualized: Information to diagnose and remediate the incident is provided across teams and topologies, including diagnostic indicators and documentation.

These incidents exclude alerts that do not signal a service or hardware problem, such as configuration changes and security updates. The right KPIs for measuring these incidents are relevant to the issue, reliable in that they accurately capture it, and are available in real-time.

Benefits of Tracking Incident Management KPIs

Tracking incident management KPIs helps IT service team managers improve their operations. They use them to highlight problems with a company’s IT infrastructure.

Incident management KPIs can show which part of the resolution process is weak. For example, users are complaining about the duration of outages. The service manager discovers that the team is quick to diagnose and resolve issues. However, a cumbersome process for logging and prioritizing tickets means it takes too long to assign a service tech. So that informs the manager that part of the incident management process needs special attention.

An analysis of tickets can reveal correlations between issues involving the same configuration, priority, user group, dependencies, tech assigned, service involved, location, or other factors. Then you can look at these holistically and design changes.

KPIs can benchmark the team’s performance and provide concrete goals for improvement. Tracking KPIs is motivating and makes the service desk more efficient. Better productivity leads to higher user satisfaction, increases the reliability of systems, and helps the business achieve its objectives. Furthermore, KPIs can shine a spotlight on hardware that needs replacement or configuration issues when incidents are categorized.

Challenges of Incident Management Metrics

There is a strong case for monitoring incident management KPIs, but it is vital to be aware of their limitations. Gathering too much data makes it hard to spot what is important. Moreover, KPIs only shed light on outcomes, not causes.

Correctly diagnosing incident management problems becomes harder when there may be more than one cause or interplay among variables in the system. This complexity requires a broader problem-solving mindset, a willingness to question assumptions, and tools that leverage machine learning.

MTTX, Mean Time metrics can give a distorted view. Imagine your company does everything perfectly, so months go by without issues. Then there’s a major incident, only one during the year, and it takes two business days to resolve. Your MTTR is 16 hours. In comparison, an average corporation faces a steady flow of small and medium issues and some significant incidents., That company’s MTTR might be two hours because of the volume of incidents. The organization that can go 11 months without any incidents is surely managing its IT environment better.

Another limitation is that KPIs do not reflect context. For example, the MTTX could be the same for an outage that occurs while a company is racing to meet a big order deadline and one that happens on a holiday when almost no one is using the service. While the MTTX is the same, the business impact of the two outages are enormously different.

Similarly, as you improve incident management, the changes may reflect in KPIs in unanticipated ways. Suppose you employ artificial intelligence (AI) to automate self-healing repair for a stream of easy-to-fix incidents. Removing that source of incidents from your data will make your mean time to resolve rise, but the operation has become more efficient.

Incident Management Metrics Dashboard

A KPI dashboard for incident management illustrates key metrics in one view. The dashboard should continuously update so that you can see at a glance the status of incident management in real-time.

Common KPIs for a dashboard include the number of open incidents, average age of open incidents, average incident duration, average impact duration, MTTD, MTTA, and MTTR.

You can add any other metric that is relevant to your situation. For example, if your team has focused on improving one part of its workflow, pick a KPI that will reflect performance changes in this process.

Incident Management KPI Dashboard

Example: For a dashboard to have value, you must have reliable data to feed KPI calculations in real-time. Here is one dashboard example that provides a live view of incoming incidents and origin, status, and how quickly the team is closing tickets.

Incident Management KPI Dashboard

Incident Management KPI Cheat Sheet

Many of the same incident management KPIs appear in reports and dashboards, such as MTTX and cost per incident. An incident management dashboard typically updates continuously in real-time, while a report may look at performance over various periods. KPI reports may even be interactive and allow the user to model potential changes using what-if scenarios. Here is a cheat sheet of the most used KPIs.

Incident Management KPI Cheat Sheet

Incident Management KPI Cheat Sheet

This is a handy tool you can use to start identifying KPIs to track.

How Do Your Incident Management KPIs Compare?

While a team’s trend over time on incident management KPIs is important, companies also want to know how their performance compares to their peers. We should note that variables such as prioritization system, geographic spread, and lead time influence KPIs.

Common Goals and Averages for Incident Management KPIs

Common Goals and Averages for Incident Management KPIs

Understanding the Incident Management Framework

Incident management processes are largely standardized under the ITIL (IT Infrastructure Library) framework. Companies that manage their computing, applications, and networks this way use the ITSM (IT Service Management) model under ITIL.

In this model, the IT service desk acts as a single point of contact with users about disruptions. The service team follows best practices for incident management to resolve issues, which can include anything from a network outage to a printer not working.

In younger companies, agile organizations, and those embracing DevOps (where development and operations teams merge), the process is more fluid. The group may tailor the incident response to the nature of the problem, often with a cross-functional team.

Incident Management Process Flow

Under ITSM, a series of steps begin when the incident arises and carries through to resolution. Many businesses have distinct workflows for minor and major incidents.

Incident Identification and Logging: The first news of the problem could be an automated alert, such as an alarm from an application that monitors network status. Users may phone, chat, or use a self-service portal to report the problem. The service desk then logs the incident and creates a ticket that includes details of the disruption and the time. The staff categorizes the incident, which may be by the service involved or the service level agreement that covers the application.

Incident Prioritization and Escalation: The service team also prioritizes the incident by severity, how many users are affected, and how disruptive the problem is to core business activities. The system may manage and self-repair minor incidents. Helpdesk staff may work with users remotely to solve low-priority issues. In those scenarios, the incident doesn’t make it to this step.

Incident Investigation and Diagnosis: For more widespread and urgent problems, assigned technicians begin to research the cause and come up with potential solutions. The team diagnoses the problem and determines the correct resolution. The service desk usually also notifies users about the incident and expected time to resolve it.

Incident Resolution and Recovery: The service engineer fixes the underlying cause and restores service to users. Testing may occur, too.

Incident Closure: The ticket is updated with action taken and any information gleaned. This step may prompt an update to users on how to avoid the problem. The service desk closes the ticket.

Watch our webinar on IT incident management to learn tips for creating an effective process.

How AI and Machine Learning Are Changing Incident Management

Artificial intelligence and machine learning are reshaping computing and powering advances in incident management. Among the gains are automating formerly manual processes and applying analytics to incidents to correlate and enrich incident data.

AIOps in which artificial intelligence is embedded in IT operations is an emerging trend. In incident management, machine learning helps teams move from reactive incident response to a proactive stance.

AIOps can perform the following functions in incident management:

  • Filter alerts so important issues stand out
  • Identify the right team to address the issue
  • Connect related alerts to minimize redundant work
  • Automate aspects of ticketing, issue categorization, and escalation
  • Correlate alerts and system changes to identify the root cause
  • Visualize topology and the timeline to expedite resolution

How to Improve Incident Management KPIs

Every incident management team is striving to improve its KPIs. But it can be hard to know what changes will have the most beneficial impact. These are some of the best ways to make improvements in incident management performance.

1

Reduce the Number of Incidents

Noise often overwhelms service teams, meaning they receive so many trouble signals that they struggle to identify the important ones. This noise impacts all incident management KPIs because high incident volume stresses resources and makes it hard for service engineers to quickly respond, diagnose, and resolve problems.

Implement rule-based alert silencing: Systems can generate alerts even when they are running normally. They can be a result of maintenance, configuration changes, security updates, patches, and migrations. Convert these alerts to incidents, which require time to process and close. By relating changes to alerts, you can automate silencing them.

Fine-tune alert parameters: A monitoring solution may have alert thresholds set too low or measure the wrong variable. This can trigger many alerts that are not actionable but still require incident management processing. Compare alert volume and the outcome (actionable/nonactionable) to identify poorly designed alerting. Notify judiciously. Alerts about low-priority issues can come through push notifications or Slack messages. Reserve phone calls for major incidents.

Correlate alerts: An alert about an incident cause may occur simultaneously as a symptomatic or redundant alert. The operator does not know they’re related, and the team runs multiple incidents about the same issue. By correlating alerts, the team can group all alerts from one incident together and eliminate duplication.

2

Improve Incident Response Time

Automate remediation of low-level issues: Free engineers to focus on high-value problems by building automated workflows to resolve routine incidents.

Clarify the escalation policy: Verify that everyone knows how to escalate a problem when appropriate. The policy should spell out whom to call, back-up responders, and when to go higher on the hierarchy.

Use AIOps: AI-powered solutions can detect system anomalies before they affect users, giving incident management teams a head-start on resolving issues.

Compare alert volume and the outcome (actionable/nonactionable) to identify poorly designed alerting. Notify judiciously. Alerts about low-priority issues can come through push notifications or Slack messages. Reserve phone calls for major incidents.

Correlate alerts: An alert about an incident cause may occur simultaneously as a symptomatic or redundant alert. The operator does not know they’re related, and the team runs multiple incidents about the same issue. By correlating alerts, the team can group all alerts from one incident together and eliminate duplication.

3

Strengthen Diagnostic Capability

Don’t forget about problem management: The goal of incident management is to end an event and return to normal operation in real-time. Problems cause incidents and require further investigation once it’s resolved. In ITIL, incident and problem management are separate roles, while DevOps tends to combine them. In either approach, do not neglect further investigation that can produce intelligence to prevent incidents.

Codify knowledge: Do post-mortems or reviews after major incidents, so you understand the event and its causes. Document everything and create a runbook that tells an incident responder how to solve the issue. The user, who might be a junior person on-call, can refer to procedures in the runbook and potentially resolve an issue alone. Write action plans for every scenario you encounter.

Use your superpowers: Leverage cutting-edge technology. The complexity and fast-changing nature of the IT infrastructure in many organizations make incident diagnosis challenging. An AI-driven topology tool ingests data from many sources and provides a real-time model that enables the team to have visibility into the entire stack. Similarly, event correlation tools make it possible to pinpoint the last good configuration, so you have a target for restoration.
4

Embrace Continuous Improvement

Iterate change: Try process improvements and give them a chance to appear in your KPIs. If MTTR falls, keep the change in place, and experiment again. The key is having data to measure the impact of the changes.

Collect incident outcomes: Gather data on every incident to identify areas for improvement. Ask yes-no questions such as:

  • Was this incident closed without action?
  • Was the incident triggered by a planned or unplanned change or maintenance that completed successfully?
  • Did the alerts in this incident indicate an actual problem with a device, host, component, application, or another service component?
  • Was this incident closed and the actual problem fixed in another incident?