IT major incident management process: Real-life examples
What's inside the video
- 4 real life scenarios
- How to turn every service request into an experience to boost your ITSM maturity
- Case study: New hire request - to fill 8000 open positions
- For employee onboarding to function smoothly, how it should work
- KPIs that matter
- Take your major incident management (MIM) process up to notch
- Case study: Major availability incident hits a web performance company
- Their incident team debugging the situation
- Major availability incident management framework
- Major availability incident management framework with ServiceDesk Plus
- A structured approach to effectively roll out a major change
- Case study: How do you roll out a change effectively and help your company embrace change?
- The SMB embraces change with ServiceDesk Plus
- Is the change process effective?
- Use case: Build a rock solid ITAM strategy and grow your organization's ITAM maturity
- An educational institution has to upgrade from Windows 8 to the latest version
- Major challenges
- How to easily solve these challenges
- Track metrics that matter
Chapter-1
Chapter-2
Chapter-3
Chapter-4
Download your free copy of the presentation
Video transcription
Take your major incident management (MIM) process up to notch
Now, let's move to our second scenario, which is handling a major incident and getting your services back online. So, as organizations, we don't really like major incidents, right? We try to steer clear of them, but it's always better to, anticipate the occurrence of these events in advance and have a strategy to deal with them because, if not, it creates just chaos and confusion. So, in this real-life scenario, we look at an organization which did not have an incident management strategy and let's see how well they responded to a major incident.
Case study: Major availability incident hits a web performance company
So this is the case study that we have. We have a web performance and security company that offers CDN, DNS, and DDoS protection to many web sites. As a standard operating process, this company's firewall team regularly deployed new rules in their web application firewall. So this is done to respond to new security vulnerabilities in the internet. So, during one such a routine update, a minor change made by one of their engineers spiked the usage of CPUs across their servers, bringing down half of the web sites around the world. So what customers ended up seeing was a 502 bad gateway error. So, as you can realize, this is a major incident of the highest magnitude.
Their incident team debugging the situation
So let's break down the sequence of events in a timeline and let's see how the organization actually worked on it. So at 13:42, the outage actually happened, and the services fail. As soon as that happens, they receive alerts from different monitoring tools, and different alerts are created such as service down alerts, financial error alerts, etc. Eight minutes into the incident, the SRE team realizes that something has gone wrong, and by that time, 80% of the traffic is already down.
So they speculate of an external attack, and finally, they declare a major incident, realizing the impact of the incident. So, their London engineering team is alerted about the global outage, and throughout this entire time period, their support team is flooded with calls. Phones are ringing off the hook and tickets are being raised a lot. Thirty-three minutes into the incident, an incident response team is formed with members drawn from multiple teams. Yes, let me state it again. At the peak of chaos and confusion, 33 minutes after major incident, an incident response team is being constituted. That's a major bottleneck right over there.
So, this IRT team was under intense pressure from the management, and they have still not identified the route cause. And nearly an hour later, they dismissed the possibility of an external attack and finally figure out that the issue was with the WAF. So a global WAF kill is implemented, and finally, the websites are taken back online. So as you can see, throughout this timeline, there were major roadblocks such as recognizing an incident, putting together a team, communicating with stakeholders, and triaging. So, how do we overcome all these bottlenecks so that your business is not affected?
Major availability incident management framework
Here is a best practice workflow that we use in Zoho to combat major incidents. So it starts off with detecting an alert from the monitoring tool and converting it into a ticket in your service desk tool. So, as soon as that happens, you recognize that there is a major incident, and then you communicate with your stakeholders such as your CIOs or your CTOs or managers of IRTs and bring them together to kickstart the process of triaging. So then you assess the impact of the incident, and you choose whether or not to declare a major incident. And by now, your end users would be panicking because they are unable to access critical business services.
So you communicate externally to them, put out an announcement saying that there is an incident and that you're working on it. So by now, you create different tasks, delegate them to appropriate resolver groups who then provide the workaround and ensure that your services are taken back online. And that ends the boundary of incident management.
Now we need to perform a root cause analysis and ensure that a recurrence of this major incident is not happening. For that, we need to create a problem ticket.
Major availability incident management framework with ServiceDesk Plus
So this is how you deal with a major incident effectively. Now, let's see how you can leverage ServiceDesk Plus to do the same.
What you're seeing on your screen right now is OpManager, which is the network monitoring software from ManageEngine. So, you can integrate OpManager with ServiceDesk Plus and ensure that whenever a monitoring alert is created, you could automatically convert it into a ticket in ServiceDesk Plus. So what you're seeing right now is exactly that implementation. So, as soon as the monitoring alert is created, this is how the ticket is reflected in ServiceDesk Plus. As you can see, a brief description of the incident is provided, and pretty much whatever you see is what we saw before in service request.
The next process is to communicate with stakeholders and inform them of this major incident. And for that, we'll make use of automation again, but this time it is the business rules. So business rules are condition-based actions, which ensure that there is no time delay in communicating major incidents. So, as soon as a ticket is logged with the subject as edified, not detected, or website down, these set of actions would be performed such as setting the priority as a major incident and placing it in the appropriate support group so that they could kickstart the process of troubleshooting. You could also send notifications to specific stakeholders, and those notifications could be in the form of an email or an SMS. So, as you can see, that's how simple it is to communicate with stakeholders in real-time. So this eliminates a major bottleneck.
So let me go back to the best practice workflow again and show us where we are. So, as you can see, we detected the major incident, and we communicated immediately with our stakeholders. So the next step is to assess the damage that has happened and declare a major incident. So we saw how there are multiple tickets being created and multiple monitoring alerts being created, and which translated to multiple tickets. So you could link all these tickets together and ensure that you troubleshoot the major incident.
As you can see, on your right side over here, all the affected assets are also associated with this incident tickets. So let me click on these Assets, and as I soon do that, all the asset details are displayed, So, the detailed CI info, you have hardware, software information, and relationships are obtained from the CMDB. So this helps you to ascertain whether major services will be affected or not. As you can see, admin services at a complete geographical location, which is Delhi here, would be affected because this server hosts these two services.
So what we do next is to go ahead and communicate this incident to your stakeholders. So for that, let me go to the homepage of the technician and let's click on Add New Announcement. So, you can create a new announcement over here and ensure that it is displayed within a specific time frame, and you can choose to show it to just the right group of affected users. So, in this case, we saw that there were users of a particular location on a particular department who would be affected. So you can ensure that the rest of the end-users can carry on with their duties, and these end-users are communicated to.
So, let us, again, go back to our best practice workflow and see how far we have arrived. So as you can see, we detected the occurrence of the major incident, we communicated it, we assessed the impact, and we communicated it via announcements. So, all that is left is to delegate different tasks, kickstart the process of troubleshooting, which is providing a workaround. So, for that, we will go back to the incident ticket and click on tasks. So, we discussed a lot about tasks previously in the service request.
So as same as that, you could create different tasks associated to different groups and ensure that you configure the right dependencies. And dependencies here matter a lot because triaging is really necessary, and in a major incident, you need to follow a defined procedure. So, once the tasks have been done and a workaround provided, you need to add it to your resolution. So you could go ahead and add your resolution over here. If it had not occurred previously, you could add it to your knowledge base. So this will help you to combat future occurrences of the same incident.
Now, we have completed the boundary or the domain of incident management. Now, all that is left is to create a problem ticket and initiate a root cause analysis as to what caused this incident in the first place. For this, you could go to Associations over here and click on Create a New Problem. So what happens here is all the details are carried over, you could create a new problem and your technician or an appropriate technician group would perform a root cause analysis. That's how easy it is. Now it looks very easy, right? We have overcome all the bottleneck data web performance company faced. We ensured that we understood the best practice and applied it into your ITSM approach with the right capabilities.
Now, as with the previous service request, we need to keep track of some essential matrixes. This is very important because only then you would know what are the gaps present in your strategy and how well you can bridge them.
So, here you have your ticket volume, your technician productivity, your resolution time, and again, your ticket churn. So I would advise you to create complete unique dashboards for dealing with your major incidents. So, as you can see, I've created a major incident management dashboard over here, which represents in real-time the major incidents defined by category, by technician, and the number of incidents closed by technician. So, that brings us to the end of our second scenario. By now you should be confident of handling any future major incident.