Last updated on: October 24, 2023
In today's fast-paced business landscape, major incidents can strike unexpectedly, impacting productivity and customer satisfaction. Regardless of the industry, setting up recovery plans and building a resilience culture are essential to ensure business continuity in times of crisis. Let's consider airline operations. Airline operations are complex, and efficiency is achieved through the concerted effort of many individuals. IT systems, however, have become essential to airline schedule compliance. Therefore, IT meltdowns disrupt many users. For example, Zylker Airlines recently experienced an IT outage, which inconvenienced thousands of passengers.
Let's have a look at the cause and events of Zylker's outage.
Cause of the outage
Zylker's airline operations were hampered due to an IT outage. The airline's website now displayed a 502: Bad Gateway error, causing problems with reservations, check-in, cancellations, and other operations. The root cause of the IT outage was attributed to a vendor-supplied firewall failure.
To respond to new vulnerabilities and threats, the vendor's firewall team releases new rules to the web application firewall's (WAF) existing managed rules on a regular basis, and then pushes the update globally. During one such update, one of the firewall team's engineers made a minor modification that inadvertently boosted the usage of the CPUs dedicated to serving HTTP and HTTPS traffic throughout the network to nearly 100%. This caused the airline's website and key systems to be inaccessible for a few hours.
Events leading
to the outage
During the maintenance window, the updated WAF rules were applied, and the system appeared to function normally. Several hours after the rule update, users began to report issues in accessing the airline's website. The IT team started investigating the cause, and after 13 minutes, it was declared a major incident, and stakeholders were notified. Meanwhile, there was a high volume of calls reporting the issue.
Members from multiple teams were gathered to form an incident response team 33 minutes after the incident occurred. The team investigated cyberattack hypotheses. It took them 53 minutes to rule out the possibility of a cyberattack, and the root cause was still not identified.
Upon further investigation, the team identified that the cause of the IT outage could be attributed to the WAF managed rule update, and they began to roll back the managed rule update to restore the service. Normal operations were resumed after 78 minutes following the disruption of the airline's web application.
Despite the use of multiple applications and monitoring tools, Zylker had several missteps and couldn't handle the major disruption efficiently. This incident highlighted the need for a robust major incident management process and effective communication protocols to mitigate the impact of future outages.
Building a
Request Life Cycle to handle major incidents in ServiceDesk Plus
Now, let's see what could have happened if Zylker Airlines had relied on a streamlined incident management process and implemented it with ServiceDesk Plus, which would have contained the outage created by the WAF rule update and minimized the disruption to its customers much more effectively.
- ServiceDesk Plus can help Zylker follow a best practice framework (Fig. 1) to ensure the company firefights major incidents efficiently.
Fig. 1: Best practice framework
To implement this framework in real time, ServiceDesk Plus uses the Request Life Cycle (RLC) feature. RLCs let you design the complete life cycle of a ticket visually using a simple drag-and-drop canvas. It further breaks down the life cycle of a ticket into various statuses and transitions. Every ticket in ServiceDesk Plus goes through various statuses, such as Open, On hold, Resolved, and Closed. With RLCs, you can design the sequence of the statuses along with the conditions and actions (transitions) required for every status change by simply dragging and dropping the statuses onto the canvas (Fig. 1.1).
Fig. 1.1: RLC drag-and-drop canvas
Transitions are actions required to move the ticket from one status to another. At each status, transitions guide technicians through conditional actions to advance to the next status. Technicians can change the status only through transitions available on the incident ticket's Details page (Fig. 1.2). There are three transition stages: BEFORE, DURING, and AFTER, which allow you to set numerous options to govern the status movement based on the fulfillment of specified conditions (Fig. 1.3). This RLC can be associated with one or more incident templates.
Fig. 1.2: Incident Details page
Fig. 1.3: Transition stages
Let's take a peek at how the RLC feature could have aided Zylker in containing the major incident.
- During the maintenance window, a WAF rule update is applied. ServiceDesk Plus integrates with multiple ITOM products, including ManageEngine OpManager, for monitoring networks and services. When OpManager identifies an anomaly in the services, an alert is automatically logged as an incident ticket in ServiceDesk Plus with all the relevant details, such as the date and time of the outage, affected systems or applications, and error messages received.
- This ticket automatically uses the major incident template through a simple no-code, rule-based automation. Once the ticket is logged with the major incident template, the major incident RLC associated with that template is instantly activated, and begins to guide the process.
- In the next three minutes, Zylker's service desk representative assesses the incident impact to avoid any false alarms and flags the incident ticket as a major incident by clicking the Report transition action on the incident ticket's Details page; this updates the ticket status and assigns the ticket to the incident response team (IRT) automatically. Then, an instant notification is sent to all the stakeholders. These actions are configured using the three transition stages in an RLC, explained below.
BEFORE action:
Zylker restricted Report transition button access to service desk representatives with specific roles for whom the transition button must be displayed, and added conditions to determine whether or not this transition button should be displayed on the incident Details page (Fig. 2). If the request type is an incident, then the Report transition button will be displayed on the incident Details page only for technicians with IT or IRT Tech roles.
Fig. 2: BEFORE action
DURING action:
While executing the Report transition, the Is it a major incident? field is mandated, where the service desk representative flags the incident as a major incident or not. If it is flagged as a major incident, then the Group field on the incident Details page is updated to be IRT (Fig. 3), which transfers the ticket to the IRT's bucket.
Fig. 3: DURING action
AFTER action:
After executing the Report transition, a custom notification is automatically sent to the IRT (Fig. 4) notifying them about the occurrence of a major incident. Apart from notifications, webhooks, tasks, and custom functions can also be triggered based on conditions. On execution of this transition, the Status of the incident ticket is moved to WIP.
Fig.4: AFTER action
-
As a next step, a triaging process is undertaken. In the Collaborate IRT transition, Zylker configured a notification and added a custom function (Fig. 5) in the AFTER transition action, which allows ServiceDesk Plus to integrate with Microsoft Teams to create a virtual war room link. Clicking this transition on the incident Details page triggers the custom notification with the virtual war room link to be sent to the distributed IRT, facilitating collaboration amongst teams working in a hybrid work model. Then, the Status of the ticket is auto-updated to Triage.
Fig. 5: Custom functions
- As triage occurs, customers are informed about the outage using the Announcement feature in ServiceDesk Plus to prevent them from flooding the service desk with new tickets.
- Now, the root cause analysis (RCA) begins. In the RCA Analysis transition (Fig. 6), tasks are added to the AFTER action that are already assigned to a specific technician group or technician to analyze the root cause.
Fig. 6: RCA Analysis transition
- In five minutes, the IRT identifies the root cause as the WAF rule update, and notifies the appropriate stakeholders. Similarly, Zylker configured three transition actions to various statuses in the major incident life
cycle (Fig. 7) as per its requirements.
Fig. 7: Zylker's major incident life cycle
- Upon finding the root cause, the incident ticket is delegated to the concerned team to roll back the WAF rule update. One of Zylker's technicians implements a rollback on the WAF rule update and restores services in 28 minutes. The entire team was able to identify the root cause and resolve it before any significant disruption.
- Once the issue is resolved, stakeholders are notified, and the resolution is updated in the knowledge base to help technicians in the future.
Thus, RLCs within the framework of major incident management provide a structured approach to addressing critical incidents, safeguarding businesses against the potentially catastrophic consequences of any outages. Discover more about ServiceDesk Plus' RLC feature here.