What is root cause analysis (RCA)?
Root cause analysis (RCA) is a systematic approach to defining symptoms, identifying contributing factors, and repairing faults when problems arise. The process can be applied to virtually any problem in any industry, from NASA’s Apollo 13 mission to everyday tech problems that happen within modern IT departments.
With all the potential use cases in mind, it’s worth noting that there is more than one way to apply RCA—and it’s important that IT teams use a root cause analysis process that’s best suited to their company and systems.
- Types of root causes
- Root cause analysis process
- Root cause analysis examples
- Benefits of root cause analysis
- Streamline root cause analysis with BigPanda
Types of root causes
A root cause can be one of two things:
- The absence of a best practice that would have prevented the problem.
- A failure to apply knowledge that would have prevented the problem.
With that in mind, there are many ways to classify the root cause you discover. Some of the most common classifications of root causes are as follows.
Human error is often characterized as a failure to apply knowledge that would have prevented the problem you’re analyzing, and it’s a very common type of root cause. For example, IT root cause analysis may reveal that a new user lacked necessary training or that a non-tech employee simply didn’t have enough knowledge of a given system.
Many root cause analysis experts consider human error to be a causal factor, not a type of root cause, since further analyzing what led to the human error (i.e., lack of training) gives a stronger foundation for identifying the true root cause of an incident. For instance, many cases of human error are linked to an organizational cause.
Organizational causes often come down to insufficient or faulty systems, policies, or processes. For instance, if a company gives a user incomplete instructions, that can lead to human error. Likewise, if an organization’s policies do not detail proper security protocols for personal devices, that can result in a variety of incidents, including a potential breach.
There are also many causal factors that can be traced back to organizational causes, such as insufficient testing of hardware or software, incorrect coding, or incorrect design. While these causal factors might be attributed to human error at first thought, they’re ultimately due to the organization’s lack of policies and protocols to define requirements and ensure they are properly adhered to.
Physical causes of IT problems can come in various forms. For instance, a server could be damaged or destroyed due to improper ventilation in the server closet.
System failure can also happen due to outright hardware failure, in which case the organization should contact the developers or manufacturers behind the system and take strides to assess the cause together and get a replacement.
Root cause analysis process
There is no one-size-fits-all process when it comes to conducting a root cause analysis. The most important thing is that the team you put in charge of the process is able to effectively apply a methodology and take the time to properly review all the nuances that factored into the event.
It’s always helpful to have someone on the team who has experience with conducting an RCA at the same scale you are dealing with. However, if that’s not possible, you can ensure a successful RCA by properly comparing methodologies and techniques, and ensuring you are applying all the latest best practices.
Once you identify an appropriate solution, it’s important to ensure you are assessing, implementing, and reviewing the results properly. Here is an overview of the five steps you should follow to conduct an RCA.
1. Review the best practices
It’s important for all members of an RCA team to review the latest best practices for conducting a root cause analysis before delving into the process. For example:
Causal factors can always be traced to root causes
Your root cause analysis report is not finished if you have identified a root cause like “human error.” As discussed above, human error is actually classified as a causal factor. If you leave your investigation there, you have failed to identify the actual root cause that led to that human error.
Most adverse IT events will have a number of causal factors involved, and it’s part of the investigation process to trace them back to the actual root cause. If you fail to do so, you will not be able to effectively prevent reoccurrence. A causal factor tree analysis can help you plot causal factors and trace them back (more on that below).
Root cause analysis doesn’t mean placing blame
The purpose of a root cause analysis is to get to the origin of a problem and figure out how that cause can be eliminated to prevent future problems. The process can sometimes result in a team placing blame on a certain department or individual, but that is not the purpose of the exercise—and that should be made clear from the beginning.
Root cause analysis should never lead to a team or individual being reprimanded. Instead, if someone’s work was found to be the root cause—like an improperly enforced protocol—they should be looped into the conversation, as they can likely provide additional insight on what happened and how it can be prevented in the future.
As an organization, it’s important to create a culture where incidents are framed as learning opportunities. If people believe that someone will be found to blame because of the analysis, data collection could be compromised and events could be misconstrued, making a proper RCA practically impossible.
Groupthink can constrain the investigation
Teamwork and collaboration are often encouraged as part of a root cause investigation; however, it’s important to avoid groupthink. While brainstorming sessions can be helpful, empower each member of the investigative team to think independently or else you may stifle creative analyses and perspectives.
With this in mind, it’s also important to bring in a diverse set of members for the investigation. Choose people from different teams or groups throughout your organization, and make sure you have employees from different functional areas in order to promote effective collaboration where each individual is able to contribute something unique.
2. Choose an analysis method
There are multiple ways to conduct a root cause analysis, and you may end up combining more than one process. Starting with the most common method, here is an introduction to some of the most widely used RCA methods for the IT industry.
Causal factor tree analysis
A causal factor is defined as an event or condition—such as a user failing to power on a machine—that contributes to an adverse event. Since there can be countless causal factors involved in an incident, a causal factor tree analysis can help you trace them to a root cause. This technique is used to investigate a single event, and that event goes at the top of the tree.
Immediate causes of the event are displayed below it on the tree, with branches linking them together. Next, the immediate causes for each of those factors is added on the line below, with more branches connecting them. This mapping process continues, with each level containing more and more factors. The cause-and-effect chain can then be visualized from bottom to top, allowing for complete visibility into the factors that led to the event.
Many times, when using this method, you’ll realize that the immediate causes of a factor are not yet known, and discovering them is part of the investigation process. As such, this technique is able to expose knowledge gaps. However, this method doesn’t provide any tool for solving those gaps, which is why combining it with change analysis or barrier analysis can be useful for answering those unknowns.
This analysis technique focuses on a given problem or incident, seeking to pinpoint how change—specifically, deviating from a policy or procedure—led to an unfavorable outcome. It’s easy to apply change analysis to just about any IT problem, and it also results in a clear, concrete next step to prevent the incident from happening again.
However, there are instances when change analysis might not be the best technique for your team. Mainly, if your organization does not have clearly defined procedures or policies to begin with that relate to the incident at hand, you won’t have enough information to properly apply this technique. On the other hand, if you have many variables in your processes, this technique might simply prove too intensive.
A barrier analysis allows your team to identify physical, procedural, or administrative failures that led to an incident. This analysis technique is helpful for identifying why the barriers you have in place (i.e. IT policies) failed to prevent the adverse event from happening and what can be improved upon to ensure that failure doesn’t happen again.
This technique starts with identifying all the barriers that were in place before the incident, and then each one must be reviewed to determine if it was functioning at that time. For instance, if the server room is supposed to be kept at 70 degrees, you should determine if the climate-control shut off unexpectedly, if the thermostat failed to register the temperature, or if the alert system didn’t send the notification it was supposed to.
Barrier analysis will allow your team to figure out if the barriers deviated from standard operation and, if not, if they were able to somehow decrease the total severity of the event. For instance, even if the climate control failed, if the thermostat notified your team as it was supposed to, then it helped to prevent major damage.
By the end of a barrier analysis, your team can determine: if existing barriers are helpful; if they were appropriately maintained and inspected prior to the event; if they need to be made stronger; or if additional barriers need to be put into place, and how. As you can imagine, the efficacy of a barrier analysis depends on what you are investigating.
Risk tree analysis
Event trees were developed in 1974 when the U.S. government was conducting the Reactor Safety Study (WASH-1400), because a fault tree analysis was simply too large and unruly for such a complex and serious circumstance. The event trees they developed allowed them to identify what leads to the most significant risk of failure without mapping every single path out in a fault tree.
Since then, risk tree analysis has been favored across industries, including by IT professionals, for its efficiency and ability to provide greater visibility into a problem. For instance, with multiple layers of detail in front of you, your team can easily identify coexisting contributors that may be entirely unrelated. However, given that there is so much detail, it can become easier to overlook subtle differences.
It’s also worth noting that risk tree analysis is a more advanced technique, and the person or people conducting the analysis should have some experience with this method to make sure it is conducted efficiently and properly.
The Kepner-Tregoe (KT) method was created in the 1960s, but it became famous when NASA applied it to the Apollo 13 mission to bring the team back home. The KT method involves gathering and prioritizing information in order to assess risk and come to a solution in the most efficient manner.
The very first step in the KT method is to identify all the problems your organization is dealing with and classify them by how concerning they are. From there, you will determine the priority by assessing the urgency and potential for impact (and continued growth) of each problem.
Once problems are prioritized, the KT method makes it easy to decide what the next best action is, who needs to be involved in the solution, and what their role will be. Like the problems you’re solving, objectives should be weighed as well, with your team categorizing each planned action as necessary or potentially skippable.
The weighted system used by this method is what makes it so powerful, as it gives your team a direct route to the minimum viable solution for the highest priority problem.
3. Determine which tools to use
There are countless tools and techniques that you can combine with the above methodologies as necessary. For instance, the “Five Whys” method is often used with a causal factor tree analysis to map out the different levels that led from one factor to another. Here’s a closer look at that technique and some other common tools you can apply.
The Five Whys
The “Five Whys” technique is just what it seems. This iterative, interrogative technique helps you delve into the cause-and-effect relationship behind the factors that ultimately led to the problem you’re investigating. The first question you ask tends to be very simple, but once you answer it, the next “Why?” gets more specific.
Continue this process and you’ll rapidly learn that there are many more contributing factors than first assessed. For instance, if the problem is that a department can’t access its dashboard, the first factor you identify might be that those users were locked out of your system, but the “Five Whys” reveal that this happened because a new firewall was recently installed and it’s too aggressive, causing it to block legitimate traffic.
A Pareto chart is a type of bar graph where the length of each bar represents either cost (in time or dollars) or frequency. The longest bars on a Pareto chart are always at the left and the shortest are at the right, allowing your organization to visually see which situation or problem is the most impactful.
In the context of a root cause analysis, a Pareto chart can help you easily analyze a lot of data all at once, allowing you to properly prioritize the problems or improvements needing to be addressed first. You can apply it early on in the RCA process if you are dealing with multiple problems that may have different root causes. You can also use a Pareto chart after conducting your investigation in order to assess which suggestions should be pursued first.
Fishbone diagrams allow you to plot cause and effect. It may look like a tree diagram rotated 45 degrees, but the purpose of these two tools is very different. While a tree diagram is designed to help you eliminate and narrow down possible causes as you move up the tree, a fishbone diagram helps you dive deeper into the causes by sorting them into categories.
These charts are often referred to as cause-and-effect diagrams, and they are particularly valuable when the potential sources of a problem might come from many different areas. These areas could include instances when you have little to no knowledge about what the root cause of a problem could be.
Scatter plots visually represent how two sets of data relate to one another. In the context of a root cause analysis, you would plot the suspected root cause on the x-axis and the resulting problem on the y-axis. If you start to see a clear pattern as you continue to plot more of both onto the diagram, it’s likely that the cause and problem are correlated.
Of course, correlation does not equal causation. So, just because you’re able to identify a strong correlation, you haven’t necessarily determined the root cause of the problem. Still, a scatter diagram can help you quickly determine if there is a correlation, giving you the confidence to look more closely at those variables.
Failure mode and effect analysis (FMEA)
A failure mode and effect analysis (FMEA) might be utilized at any stage of the root cause analysis process, as it can help you identify and explore the potential points of failure for your existing processes, policies, and procedures. A FMEA can also tell you the potential impact of a failure, which can help you assess priority.
While very valuable to the RCA process, an FMEA requires you to assemble a team of cross-functional stakeholders who are familiar with whatever you are analyzing, which can make them somewhat resource intensive. With that said, it’s worth noting that you can conduct an FMEA at any time to help identify potential root causes long before any incidents occur.
4. Identify an appropriate solution
No matter what type of root cause analysis you choose to apply to an incident, it should always result in a clear path forward with a set of improvements and suggestions that will prevent the problem from reoccurring. However, it’s essential that you test your assumptions before going about implementing them.
In the worst-case scenario, you may find that the proposed solution doesn’t actually solve for the root cause, in which case you may need to reconduct your analysis and determine if you have actually identified the true root cause and all the extenuating factors involved. While that’s a disappointing result, it’s far better than implementing a laundry list of improvements and finding that the problem still reoccurs.
Once your team is certain that an appropriate solution has been identified, it is important to map a strategy for implementation:
- Prioritize the suggested improvements based on the potential impact of failures that could occur until they are in place, along with the resources required to execute the improvements.
- Identify who is responsible for each improvement, and a reasonable timeline for doing so. Ensure that someone is assigned to who they can report to as they implement the suggestion.
- Include the implementation process in your documentation for the root cause investigation, taking note of any additional findings or adjustments made to the solution.
Depending on the size of your organization and severity of the root cause, it could take weeks to months to fully implement all suggestions that resulted from the root cause analysis. However, it’s important not to skip the final step—follow up.
5. Follow up and review
Following up on the improvements weeks or months after they were implemented will allow your team to accomplish a few key tasks, such as:
- Assessing the outcome of the improvements and whether they have effectively prevented the incident from reoccurring.
- The total cost and resources invested in the improvements, allowing for a proper assessment of the risk-reward equation.
- Additional improvements could be made following a thorough review, along with a discussion of any gaps that were identified in the original solution.
Once the case is closed, it’s also worth considering conducting an FMEA to review new or revised policies and procedures and prevent more incidents from occurring where possible. Conducting these reviews on a routine basis when major changes are made within the organization will likely prove worthwhile in the long run.
Especially for IT departments, setting up a root cause identification tool can also help you resolve future incidents faster. BigPanda’s console highlights Root Cause Changes, which is a great example of how machine learning can help IT teams avoid long and painful outages by giving them insights into real-time system changes and status.
Root cause analysis examples
There is no shortage of incidents in modern IT organizations, whether you’re running on legacy systems with roundabout integrations or a complex hybrid setup that sees thousands of changes every day.
There is always room for error, and it’s the goal of your team to help eliminate incidents and outages by taking as many precautions as possible. Of course, hardware-related IT issues are simple to analyze and solve for compared to software and infrastructure-related problems.
When something goes wrong with your services or software, it takes more than plotting variables on a fishbone diagram to figure out what went wrong. With so many moving parts, your team has to get technology involved.
Automation and machine learning are two of the components that can help you get to the true root cause by ingesting massive amounts of data from all major change management, CI/CD, and orchestration tools. In turn, your team can quickly review changes and correlate them to incidents in real time, saving countless hours spent guessing, testing, and going back to square one.
If you’re interested in exploring more root cause analysis examples and how technology can simplify the solutions, read a quick blog on Big Panda’s root cause identification tool for ITOps.
Benefits of root cause analysis
IT teams can certainly function without utilizing root cause analysis, but taking the time to investigate symptoms ultimately saves resources. Instead of mitigating symptoms as they come, RCA requires teams to take a closer look at what’s causing that anomalous or unwanted behavior. As a result, RCA can reveal underlying issues with policies, processes, hardware, software, or training and identify a solution.
Not only will applying RCA help prevent the same symptoms from reoccurring in the future, but it can also lead to the discovery of the underlying root cause that maybe has played a role in other incidents or could have led to a much bigger problem down the road if it wasn’t addressed.
For instance, if a piece of equipment was inconveniently shutting down, RCA might reveal that it was faulty, allowing the company to replace it while under warranty, preventing a major expense and complete failure at a later date. Preserving resources and creating more robust systems are just two benefits of the root cause analysis process.
Some other benefits of IT RCA include:
- Improving workflows
- Implementing more effective policies
- Streamlining internal processes
- Properly documenting protocols
- Permanently eliminating known symptoms
Streamline root cause analysis with BigPanda
Root cause analysis isn’t optional. When things go wrong with your systems, you need to know exactly what led to the problem, and quickly, so that you can restore operations and ensure outages and incidents don’t happen again.
The thing is, root cause analysis in the IT industry has grown increasingly complex. With hybrid systems and hundreds of integrations, today’s companies can’t simply gather stakeholders around a table to discuss what happened. In order to stay on top of your ITOps in real time, you need technology capable of monitoring your tech stack and instantly notifying you when things don’t go to plan.
BigPanda assists root cause analysis in four key ways:
- Our event timelines bundle related events across multiple sources together into a single view, making it easy to see which events fired first. This allows the operator to identify the earliest failing resources in any incident they investigate.
- Our topology mesh visualizes the relationships between impacted resources in any incident. This depiction updates in real time as new events arrive, so the operator clearly sees failure points and blast radius.
- Our incident-dynamic titles update automatically as new events arrive, so the resources most impacted are clearly identifiable at a glance.
- Our Root Cause Changes capability ingests recent changes from your CI/CD tools and automatically correlates ongoing incidents with suspicious changes most likely to have caused them.
Are you interested in learning about how BigPanda uses AIOps to keep all the moving parts working correctly for today’s IT teams? Request a demo today to learn more.