Last updated on: July 26, 2024
Imagine you're working in IT and you're facing an incident. You might start by treating the issues that you see on the surface. For example, if your website is down, you might try to resolve the incident by restarting the server. However, if you don't address the root cause of the incident, it's likely to happen again. In this case, restarting the server may only be an interim solution, which may cause the incident to recur. To find a permanent solution, it's important to analyze the root cause of an incident. Root cause analysis (RCA) can help teams do this by asking questions like, "Why did this happen?", finding the underlying issues, and fixing them so the incident doesn't occur again.
In this article, we will go over how you can get started with RCA, the steps involved, and the types of RCA to help you to find the primary cause of any problem.
Root cause analysis (RCA) is a systematic approach that drills deep to identify the root cause of an incident by repeatedly asking "why" questions until no additional diagnostic responses can be provided. It typically involves analysis or a discussion soon after an incident has been resolved.
The main benefit of RCA is that it finds fundamental errors, enabling teams to find the right measures to fix problems and stop them from recurring. By using a variety of methods, RCA can help uncover clues that might otherwise be overlooked during the incident resolution process. This can lead to the identification of the exact cause of the incident, which can then be used to prevent similar incidents from happening in the future.
An open-source code repository company
The company experienced a major online service outage, which was caused by the accidental removal of data from the primary database server.
The incident led to hours of unavailability of repository services to users. To prevent future occurrences, the company made several operational and recovery procedure enhancements. Initially, the company relied on a single primary and secondary database in standby mode, with the secondary serving as a failover backup. However, this setup placed excessive load on a single database.
An engineer set up several special storage servers for information in a staging environment to balance any of the incoming load. Before starting the work, the engineer took a snapshot of the production database and loaded it into the company's staging environment. While attempting to restore processes to normalcy, the engineer wiped out the PostgreSQL database, mistaking it for a secondary database. By the time the mistake was realized and the database reverted to a previous state, approximately 300GB of data had already been wiped out.
To recover the repository, the recovery team had to use the LVM (Logical Volume Manager) snapshot from six hours before the outage. Once the repository was up and running, the team proceeded to use the five whys method to perform RCA.
The incident was further broken down into two major problems:
1. Service was down for 18 hours:
The questions asked were:
- Why was the repository down?
- Why was the database directory removed?
- Why did the replication stop?
- Why did the database load increase?
And so on.
2. Restoring the service took over 18 hours:
The questions asked were:
- Why did restoring take so long?
- Why was the staging database needed for restoring?
- Why did the team not use the standard backup procedure?
- Why was the backup procedure not tested on a regular basis?
And more.
The organization was able to improve its different recovery procedures, including disaster recovery, thanks to this drill-down study that aided it in identifying the gaps in those procedures. The incident additionally led to the establishment of a watertight monitoring dashboard to speed up future resolution times.
The above case highlights the role RCA played in the company's efforts to reduce downtime in the future and make its operations and services more efficient. Now, let's see how your organization can perform RCA and what the steps are.
An RCA map may look slightly different across organizations and industries, but here are the five most common steps
to perform RCA:
1. Define the problem:
When an incident occurs, your first move is to contain or isolate the affected areas. This is where the incident is resolved and ends. The problem begins when the incident needs to be eliminated from ever occurring again
and a deep dive into the reasons for the occurrence is necessary. This is where the RCA process begins and where the need to define the problem is crucial. Defining the problem requires you to know the problem that
is being solved, the effect it has caused, the time and date of the occurrence,
and so on.
2. Gather data:
Once you have found the problem, compile all available data and evidence related to the specific incident to begin understanding the underlying cause. It's also important to take into account the first-hand experience and evidence from the people involved in the incident or previous, similar incidents.
3. Determine the root cause:
Here's where the RCA process begins. You can use a variety of RCA techniques, and each technique helps you look for small clues that may reveal the root cause.
4. Implement the solution:
Determining the root cause will indicate one or more solutions. You might be able to implement them right away, or the solution might require some additional work. Either way, RCA isn't done until you’ve implemented a solution or a work-around, depending on the feasibility.
5. Document actions taken:
After you've identified and performed corrective actions, document the problem and the overall resolution so that future employees can use it as a resource or reference.
Popular RCA methods
The goal of RCA is to recognize all the underlying causes of a problem. Using an analysis method is a useful tool to accomplish this task. Five popular RCA methods are:
- The five whys method
- Fishbone diagram
- Pareto chart
- Scatter diagram
- Kepner-Tregoe method
1. The five whys method
The five whys method is a simple and effective way to identify the root cause of a problem. To use it, ask "Why?" five times in a row. In this process, if the first question doesn't find the root cause, ask "why" again. Continuing this process for a few times will help you find the underlying cause.
Here are the steps in more detail:
- Define the problem.
- Ask why the problem happened.
- Write down the cause.
- If your first question did not find the root cause, ask "Why?" again and write down the cause for that.
- Continue this process until you've identified the root cause of the problem.
The five whys method can be used to pinpoint performance-related issues. This approach makes it possible to perform a more thorough drill-down study of the problem and identify the main causes of changes in the performance of the IT infrastructure, technicians, personnel, and other elements.
2. Fishbone diagram
A fishbone diagram, also called an Ishikawa diagram or a cause-and-effect diagram, is a visual way to map cause and effect. The spine of the fish skeleton in the middle of the diagram represents the specific problem, and the rib bones of the skeleton that branch out from the spine represent potential causes. In service management, there are three aspects: people, process, and product. The branches in the diagram are further classified into smaller, more specific causes based on people, process, and product, which gives a better visual view for service desk technicians. It enables the service desk teams to find the underlying cause by getting to the nitty-gritty factors that would otherwise go unnoticed.
Steps involved in conducting RCA with a fishbone diagram:
- Identify the problem you are trying to solve. Gather as much data available regarding the problem and its occurrence.
- Once you've identified the problem, brainstorm the potential causes using a fishbone diagram. The fishbone diagram helps visualize and identify the different categories of causes.
- With the potential causes brainstormed, categorize the causes under the factors that could influence the incident, such as people, process, environment, and machine.
- Once the categories are visualized, one of the ribs in the fishbone diagram is bound to have multiple causes, which will invariably be the underlying root cause for the incident.
- Finally, come up with the corrective actions to address the root cause and implement them. Once implemented, monitoring the effectiveness of the solution prevents gaps in the implementation and helps put in place a watertight solution for the ages.
Fishbone diagram uses:
- Improves the quality of service delivery. Identifying the lapses helps improve the overall service delivery quality and improve customer satisfaction.
- Reduces cost overruns in times of turmoil. With global economy headwinds looming, the fishbone digram helps find the root cause for a bloated budget, make necessary cost cuts, and more.
3. Pareto chart
Pareto charts identify the most significant factor amongst a large set that could be causing the problem. A Pareto chart is a combined bar and line chart, where the factors are plotted as bars arranged in descending order of occurrence count. The chart is accompanied by a line graph showing the cumulative totals of each factor, from left to right. The Pareto chart is a type of bar chart that uses the 80-20 principle to identify the important factors contributing to a problem. The 80-20 principle states that 80% of the incidents are caused by 20% of the overall infrastructure. This means that a small number of factors have a disproportionate impact on the number of incidents in the organization.
Steps to perform RCA using the Pareto chart:
- Just like how we saw in the previous method, identify the problem and gather the necessary data points.
- Segregate the data points into various categories.
- Compute the frequencies and find the cumulative percentage in decreasing order.
- Plot the data on a graph to create the Pareto chart.
- Finally, implement procedures and processes to prevent the recurrence of the problem.
This chart helps determine the problem areas and identify the critical aspects to be resolved first to reduce the recurrence significantly.
Pareto chart uses:
- Identifies the most common incidents by users, and enables technicians to find permanent fixes and prioritize their resolution efforts.
- Identifies the number of tickets raised for incidents with premade knowledge articles. This enables the service desk to analyze the root cause for such occurrences and more.
4. Scatter diagram
Scatter diagrams, or scatter plots, use regression analysis to graph pairs to determine relationships of numerical data with variables on two different axes, such as the priority of tickets and the number of incidents that have occurred. This is helpful to identify problems that occur due to fluctuating measurements, such as capacity issues that happen when server traffic increases.
Scatter diagram uses:
This method can be used to organize and keep track of organizational processes. It aids in increasing the quality of the product or service by comparing the output accuracy with the accepted output.
Scatter diagram for quality characteristics
5. Kepner-Tregoe method
The Kepner-Tregoe (KT) method is a problem-solving approach that identifies the underlying cause of an issue. It involves analyzing the various factors that contribute to the problem and eliminating the ones that are irrelevant, thereby isolating the key elements that need to be addressed. The KT method can be used for troubleshooting IT incidents, making IT decisions, managing IT risks, and planning projects by weighing in the pros and cons to make informed decisions.
The KT method is a four-step systematic approach to solving complex problems. The steps are:
- Situational analysis: This involves gathering information about the problem, including its definition, impact,
and symptoms. - Problem analysis: This involves identifying the underlying cause of the problem. This is done by using a cause-and-effect matrix to identify potential causes, and then brainstorming within each category.
- Decision analysis: This involves weighing the pros and cons of different solutions to the incident and selecting
the best one. - Potential problem analysis: This involves identifying potential problems with the solution and developing alternate plans to address them.
The picture below shows the four steps involved in the KT method for performing RCA. Each step is essential to ensure successful RCA and resolution.