RCA in IT: 5 mistakes that can cripple your root cause analysis
Jul 15 · 07 min read
A major incident strikes—we're all in it. The sysadmin, network operations, security, and privacy teams put their heads together to work on the fix. We might even work extra hours to ensure the service is up and running. And then...we forget all about it. Sound familiar?
At ManageEngine, we never waste a crisis. We analyze the problem and trace it to its origins by performing root cause analysis (RCA) so we can fix the underlying issues within the root cause. We believe our incident management process is only as good as this analysis as it helps us take corrective action wherever necessary to prevent future issues systematically.
We've conducted plenty of root-cause-analysis meetings and made a few mistakes along the way, and we're here to share our learnings so you can avoid those blunders.
1. Labeling human error as the root cause
There is often the temptation to conclude that human error was the root cause because it seems to fit most cases. Companies are made of employees, and any mistake that leads to an incident will involve a human committing an error. However, that does not imply the person is the root cause.
We believe there's a policy or a procedural flaw that triggers human error. If you fix the underlying policy, employees will be less likely to make similar errors.
Let's take the example of an incident that occurred on our premises many years back. In the CyberSec section of our incident management handbook, we briefly discussed the role of our Red Team, which performs white-hat hacking to spot security loopholes. It once ran a scan to find out the passwords of our employees' office computers; this was obviously to ascertain who had weak passwords. The Red Team noticed some employees had not changed their computer’s default password, the one allotted when they first got the device.
As a security measure, we always recommend employees change their default passwords. This is to encourage them to use stronger, more personal passwords. When an employee doesn't change their password, it becomes a security risk. Anyone with a little knowledge of how we generate default passwords could hack the employee’s computer if the default password isn’t changed immediately—and this is exactly what our Red Team did. So, we recorded this as an incident and came up with the root cause.
Now, our recommendation to change the default password is loud and clear as soon as an employee receives their computer. And we could easily conclude that the employees who didn't change the default passwords were the root cause of this incident.
However, that's far from the truth. We concluded that our password policy was the major root cause, so we strengthened it in Active Directory (AD). The idea is never to put the blame on a person, but to try to improve our policies and procedures.
This mistake must also be avoided during other types of incidents, like the one below:
2. Concluding there’s just one root cause
In the previous scenario, it’s easy to just call it done after amending the password policy. However, we dug further and saw what else could have been done to prevent the incident. The policy was of course a crucial factor, but there were other reasons why our Red Team was able to crack the default passwords.
So, we found two more supporting causes:
- Default passwords with the same patterns were repeatedly generated. This is why the Red Team's scan was able to spot the pattern easily. So, we needed to create unique default passwords from AD.
- The password reset procedure was not being triggered frequently, so employees never thought to strengthen their passwords. We made it a policy to trigger the password reset process in a periodic manner.
Even though the main root cause was the password policy, getting one of these two supporting causes right could have still saved many passwords. So, we never stop at one root cause. We always dig deeper to discover more.
3. Gathering insufficient evidence
It is quite easy to make this mistake since we tend to be satisfied once we find multiple root causes and decide the next course of action. However, without solid evidence, these root causes could simply be your opinions. Even though they may still be true, you need sufficient evidence to be sure. Gathering this evidence also makes it easy for others to refer back to the incident in the future.
In the above scenario, our evidence was:
- The record of how frequently passwords were changed in AD.
- The list of default passwords generated (used to identify patterns).
- An audit of how well the password policy was being followed in the company.
Gathering this evidence also meant we had sufficient data and resources to perform deeper impact analysis.
4. Not involving the right people
The next step is to understand the impact fully and establish preventive measures.
Though the Incident Management Team is the pivot of our process, it can't be the sole contributor to RCA. Subject-matter experts could greatly contribute to how much we learn from the incident and apply those insights.
In the above example, we involved the head of IT, head of security, data protection experts, and a few others. We got even more insights into the incident’s impact:
- With the default credentials being used on some computers, there was increased risk of our VPN being compromised.
- If the VPN was compromised, an attacker could further extract the passwords of other applications.
- If the user hadn't activated two-factor authentication for some of these applications, an attacker could gain access to even more information sensitive to ManageEngine.
Understanding these possible repercussions was crucial for taking further steps after the incident, and we would have missed these if we had involved just the Incident Management Team. However, the Incident Management Team is still responsible for involving the right people to get more insights from the incident.
5. Not conducting a solid RCA meeting
Finally, a weak RCA meeting could impact how you implement measures to prevent similar incidents. A solid RCA meeting involves the right people and has an agenda that designates the depth you want to go into.
In this example, our RCA meetings involved in-depth analysis of how to ensure a weak password will never give an attacker the opportunity to come close to our company’s sensitive information. We had this as the agenda and prepared a series of questions surrounding the incident for the experts to answer.
We ended up with multiple measures, like:
- Including two-factor authentication for our VPN and other sensitive applications.
- Disabling regular access to some crucial areas in our infrastructure.
We got these thanks to arranging a solid RCA meeting.
What could have been a simple blame game of employees not being aware of strong passwords turned into profound analysis of our own policies, procedures, and IT infrastructure. Looking back, we have prevented that incident, and many related incidents, from happening again at ManageEngine. This was possible because we avoided these mistakes while performing RCA.
If you want more insights into how we carry out RCA and handle our overall incident management process, check out our incident management handbook.