For this section, we will dive into the various techniques employed to find the root cause of a problem in an IT environment.
IT Problem management techniques
The problem management process can be mandated with a good service desk tool, but the techniques used for investigation and diagnosis should vary according to the organization. It's recommended that investigation techniques are flexible based on the organization's needs rather than being overly prescriptive.
Since problems can appear in any shape or size, it's impossible to stick to one technique to find a solution every time; instead, using a combination of techniques will yield the best results. A simple LAN connectivity problem might be solved with a quick brainstorming session, but a network or VoIP issue might need a deeper look.
Here are several techniques you can practice in your organization's problem management process.
Brainstorming
By establishing a dialogue between departments, you gain various perspectives and new information, generating many potential solutions.
To have a productive brainstorming session, you need a moderator. The moderator handles the following:
- Driving the direction of the meeting
- Documenting the insights obtained
- Highlighting the measures to be taken
- Tracking the discussed deliverable
- Preventing a time-consuming session
Brainstorming sessions are more productive when collaborative problem-solving techniques, such as Ishikawa analysis and the five whys method, are used. These techniques will be discussed later in this section.
Kepner-Tregoe method
The Kepner-Tregoe (K-T) method is a problem-solving and decision-making technique used in many fields due to its step-by-step approach for logically solving a problem. It's well-suited for solving complex problems in both proactive and reactive problem management.
The method follows four processes:
- Situation appraisal: Assessment and clarification of the scenario
- Problem analysis: Connecting cause with effect
- Decision analysis: Weighing the alternate options
- Potential problem analysis: Anticipating the future
However, problem analysis is the only part that concerns IT problem management, and it consists of five steps.
Define the problem
Identifying what the problem truly is can be a problem in itself. Since problem management is inherently a collaborative effort, having a comprehensive definition of the problem eliminates preconceived notions that any participating member might have, saving a considerable amount of time.
For example, if an organization's automatic data backup on a server has failed, the problem can be defined as:
Failed backup on server
This definition indeed describes the deviation from the normal situation, but it demands more questions and information. A good model of a definition should be unambiguous and easily understood.
To remove ambiguity, the above definition can be updated to:
Data backup on November 15 failed on server #34-C
This definition provides more clarity, and spares employees from redundant questions. Nevertheless, this definition can be further improved. Suppose the cause of the data backup failure can be attributed to an event such as the application of a new patch; then the initial problem analysis would undoubtedly lead to this event.
To save time and effort, let's update the definition to:
Data backup on November 15 failed on server #34-C after application of patch 3.124 by engineer Noah
This detailed definition leaves no room for redundant questions, and provides a good amount of information on where the problem could lie. These extra minutes spent on the initial definition save valuable time and effort, provide a logical sense of direction to analysis, and remove any preconceived notions about the problem.
Describe the problem
The next step is to lay out a detailed description of the problem. The K-T method provides the questions that need to be asked on any problem to help identify the possible causes.
The questions below help describe four parts of any problem:
- What is the problem?
- Where did the problem occur?
- When did the problem occur?
- To what extent did the problem occur?
Each of these questions demands two types of answers:
IS: As in, "What is the problem?" or "Where is the problem?"
and
COULD BE but IS NOT: As in, "Where could the problem be but is not?"
This exercise helps compare and highlight the what, where, when, and how the deviation from normal performance in business processes is happening.
Establish possible causes
The comparison between normal performance and deviated performance made in the previous step helps in shortlisting the possible causes of the problem. Making a table with all the information in one place can be helpful to make the comparison.
Is | Could be but is not | Differences | Changes | |
---|---|---|---|---|
What | Server #34-C backup failed after patch 3.124 | Failed backups in other servers with patch 3.124 | New engineer (Noah) applied the patch | New patch procedure followed |
Where | 4th floor server | Basement servers | Normally done by Level 3 engineers | Level 1 engineer applied it |
When | November 15, 12:32am | Any other time | None noted | |
Extent | Only on server #34-C | Any other server | None noted |
New possible causes become evident when the information is assembled together. For our example problem, the root cause can be narrowed down to:
Procedural error caused by the inadequate transfer of knowledge by the Level 3 engineers.
Whatever the problem, a sound analysis for possible causes can be done based on relevant comparison.
Test the most probable cause
The penultimate step is to short-list the probable causes and test them before proceeding to the conclusion. Each probable cause should follow this question:
If _______ is the root cause of this problem, does it explain what the problem IS and what the problem COULD BE but IS NOT?
Again, it's beneficial to populate all the information into a table.
Potential root cause | True if | Probable root cause? |
---|---|---|
Server #34-C has a problem | Only server #34-C has been affected | Maybe |
Incorrect procedure | Same procedure affects another server | Probably |
Engineer error | Problem did not reoccur with same procedure | Probably not |
Verify the true cause
The final step is to eliminate all the improbable causes and provide evidence to the most probable causes. With this verification, it's time to propose a solution to the problem. Without evidence of the possible root cause, the solution should not be attempted.
Ishikawa analysis, or fishbone diagram analysis
Ishikawa analysis uses the fishbone framework to enumerate the cause and effects of a problem, and can be used in conjunction with brainstorming sessions and the five whys method. The simplicity in executing RCA using an Ishikawa diagram shouldn't deceive you of its prowess to handle complex problems.
To start the analysis, define the problem and use it as the head of the fishbone. Draw the spine and add the categories that the problem could be originating from as ribs to the fishbone.
Generally, it's easiest to start the categories with the four dimensions of service management: partners, processes, people, and technology. However, these categories can be anything relevant to your problem, environment, organization, or industry.
Once these categories form the ribs of the fishbone, start attaching possible causes to each category. Each possible cause can also branch out to detail the reason for that occurrence. This could lead to a complex diagram of four to five levels of causes and effects, subsequently drilling down to the root cause of the problem.
It's recommended to split up dense ribs into additional ribs as required. Alternatively, merging empty ribs with other suitable ribs keeps the fishbone clean and easy to read. Additionally, you should ensure the ribs are populated with causes, not just symptoms of the problem.
This analysis is again a collaborative effort, and requires a moderator to direct the brainstorming sessions in an effective way. Every participant has the opportunity to engage, providing a comprehensive view of the problem.
Pareto analysis
The Pareto principle is an observation that approximately 80 percent of effects come from approximately 20 percent of causes. This observation applies to a wide range of subjects, including problem management.
When trying to reduce the number of incidents occurring in an organization, it's highly efficient to apply Pareto analysis before jumping into solving the problems. Pareto analysis prioritizes the causes of incidents, and helps in managing problems based on their impact and probability.
This analysis is carried out by generating a Pareto chart from a Pareto table. A Pareto table consists of the cumulative count of classification of all problems. A Pareto chart is a bar graph showing the cumulative percentage of the frequency of various classification of problems.
To create a Pareto chart, follow the steps given below:
- Collect problem ticket data from your service desk tool.
- Remodel the data into categories based on various attributes.
- Create a Pareto table to find the frequency of problems in each classification over a period of time.
- Compute the frequency of problem occurrences in each category.
- Generate the cumulative frequency percentage in decreasing order.
- Plot the data on a graph to create a Pareto chart.
The most important step is to remodel the data into a countable set of classifications and attributes.
Classification | Attribute | ||
---|---|---|---|
Impact | Affects business | Affects department | Affects user |
Priority | Low | High | Urgent |
Category | Network | Hardware assets | Software assets |
Duration | In SLA | Outside SLA | No SLA |
Classification | Attribute | Count | Cumulative | % of contribution |
---|---|---|---|---|
Duration | No SLA | 670 | 1,470 | 38.72% |
Priority | High | 550 | 2,020 | 53.21% |
Duration | Outside SLA | 500 | 2,520 | 66.39% |
Category | Network | 430 | 2,950 | 77.71% |
Priority | Urgent | 300 | 3,250 | 92.73% |
Category | Software assets | 270 | 3,520 | 92.73% |
Category | Hardware assets | 150 | 3,670 | 96.68% |
Impact | Affects department | 80 | 3,750 | 98.79% |
Impact | Affects user | 35 | 3,785 | 99.71% |
Impact | Affects business | 9 | 3,794 | 99.95% |
Duration | In SLA | 2 | 3,796 | 100% |
This chart helps identify the problems that should be solved first to significantly reduce service disruption. This analysis complements the Ishikawa and Kepner-Tregoe methods by providing a way to prioritize the category of problems, while the other methods analyze the root cause.
It's important to remember that the 80/20 rule suggests likely causes, and may be incorrect at times.
Five whys technique
Five whys is a straightforward technique for RCA. It defines a problem statement, then repeatedly asks why until the underlying root cause of the problem is discovered. The number of whys doesn't need to be limited to five, but can be based on the problem and the situation.
The five whys technique complements many other problem-solving techniques like the Ishikawa method, Pareto analysis, and the K-T method.
Using the previous example of the data backup failure in a server, let's apply the five whys technique.
Why did the data backup fail in server #32-C? | Due to the application of patch 3.124. |
Why was it due to patch 3.124? | The procedure used was different. |
Why was the procedure different? | A Level 1 engineer was responsible for it. |
Why was the Level 1 engineer responsible? | The Level 3 engineers were busy with a major incident and had improper transfer of knowledge. |
Why was there an improper transfer of knowledge? | There isn't a standardized schedule or format used in the organization. |
The above iterative process reveals the absence of a standardized format, which has led to the problem of data backup failure.
For our purposes, the example above is a simple execution of the method. In a real scenario, the next question depends on the answer to the previous question, so it's imperative to collaborate with stakeholders who have elaborate knowledge of the domain the problem resides in.
By adopting parts of the K-T method along with the five whys technique, such as providing evidence to each answer before validating it with a return question, you can ensure precise analysis during problem-solving sessions.
Other techniques
Apart from the five major techniques, there are still numerous others, each with their own unique strengths. Overall, problem investigation is carried out using a combination of techniques suitable for the situation. Some other techniques that are prevalent in the problem management community are chronological testing, fault tree analysis, the fault isolation method, hypothesis testing, and pain value analysis. It's worth taking the time to learn many techniques as your organization's problem management process matures.
Up next:
You have made it so far! In our penultimate part of the six-part series, you will learn about the best practices of problem management that can help you jump past any hurdles during your problem management journey.
Assess your incident response readiness to kick-start your problem management journey
The zeroth step in the journey towards proactive problem management is establishing a robust incident management process in your IT environment. Discover how Zoho, our parent company, handles the spectrum of incidents thrown at it year over year and assess your incident management readiness at an enterprise scale.
Download a free copy of our incident management handbook and a best practice checklist to review your problem management solution.
-
Problem management feature checklist
-
IT incident management handbook