Creating IT incident reports: A gateway to insights in the aftermath of IT incidents

November 26 | 09 mins read

How to make an incident report

In pursuit of institutional memory

Digital enterprises strive to deliver always-on service experiences to drive employee productivity and overall business revenue. However, despite the multi-pronged efforts of IT teams, minute blips can go unnoticed, snowballing into unexpected IT incidents leading to cyberattacks and service outages.

Further, the absence of institutional memory impedes IT teams from retracing their steps and enforcing course corrections while resolving IT incidents. Maintaining a central record of incident reports with all details about an incident and its resolution is crucial to streamline efforts, especially when it takes 277 days on average to detect cybersecurity incidents. It's imperative to craft a well-documented IT incident report to help IT teams discern what has gone wrong, zero in on possible root causes, identify who's in charge, and ultimately aid in effective resolution.

Getting familiar with IT incident reports

An IT incident report is a formal document that outlines the critical characteristics of IT incidents in a central location, ranging from finding out what happened and analyzing when and how it unfolded to the final resolution and the next steps. Consolidating these details in a central repository helps IT teams learn from past experiences and devise improvements, enhancing the reliability of their service operations.

Setting the foundation for employing IT incident reports

To understand why digital enterprises must harness IT incident reports, here's a quick glance at various scenarios and the utility of such reports therein.

Scenarios The utility

Reporting a data breach to regulatory authorities within a specific time limit

Ensures adherence to compliance requirements defined by regulatory laws

Poor coordination between the NOC, SOC, and incident response teams during a DDos attack on an application due to lack of IT visibility

Poor coordination between the NOC, SOC, and incident response teams during a DDos attack on an application due to lack of IT visibility

Hosting a CRM application on an outdated server linked to multiple IT incidents

Facilitates root cause analysis when the CRM application crashes

Sluggish performance of a media app due to a traffic surge during the holiday season

Helps understand trends and patterns to bolster remediation measures

Stakeholders unfamiliar with the IT incident management playbook

Serves as a useful source of reference for knowledge transfer and training

From the scenarios listed above, it's clear that documenting IT incidents through incident reports can deliver benefits multifold. Let's now dive into the specifics of creating an IT incident report.

Structuring an IT incident report

By clearly defining the structure of an IT incident report, enterprises can foster coherent documentation, enabling easy access to information. Here's how it could look.

Incident report form template
Fig.1 A typical structure of an IT incident report.

With an idea of how enterprises can structure their IT incident reports, let's now delve into how they can build one from scratch.

Crafting IT incident reports

Say Zylker, a fictional multinational FinTech company operating an online payment gateway, suffers an unexpected service outage. Let's see how it captured the nitty-gritty of this episode step by step across the various sections of an incident report:

1. Summary

This section contains a concise overview of the incident highlighting what happened, when and where it occurred, and the symptoms.

For example, users in the United States (US) were unable to access online payment services on June 15, 2024, from 2pm - 4:30pm due to an outdated SSL certificate. During the incident, they encountered a security warning showcasing a 525 error.

With this information, Zylker was better equipped to gauge the nature and scope of similar incidents.

2. Detection and impact assessment

Detection:

This section contains information about the source of the incident and the time taken to detect it. For instance, Zylker examined its monitoring logs, which revealed a spike in error rates related to SSL handshake failures at 2:04 pm. It also noted an uptick in calls on its customer support channels starting from 2:15 pm.

This section contains information about the source of the incident and the time taken to detect it. For instance, Zylker examined its monitoring logs, which revealed a spike in error rates related to SSL handshake failures at 2:04 pm. It also noted an uptick in calls on its customer support channels starting from 2:15 pm.

Impact assessment:

Further, the impact assessment section enumerates the impact of the IT incident on different users across geographies as well as the IT services, applications, and hardware affected therein.

In Zylker's scenario, payment processing and account operations for customers and merchants in the US were unavailable. Besides Zylker's web servers' inability to serve HTTPS traffic, its IT components, including the company's mobile app, its API gateways, and its merchant and e-commerce integrations, were also affected, resulting in failed transactions.

3.Timeline

A detailed sequence of events from detection to resolution along with their time stamps is a critical part of an incident report. This also includes the lead-up events, the actions of any stakeholders involved, and escalations. Here's how it looked in Zylker's case:

Date and Time Event

June 15, 2024 | 2pm

Online payment services were inaccessible to users in the US data center.

June 15, 2024 | 2:04pm

Monitoring tools detected a spike in SSL handshake error rates.

June 15, 2024 | 2:15 pm

Users reported the incident to Zylker's support team.

June 15, 2024 | 2:20pm

Initial investigations by the IT operations team confirmed that the SSL certificate had expired.

June 15, 2024 | 2:30pm

The incident was escalated to the network security team to expedite certificate renewal.

June 15, 2024 | 2:45pm

A new SSL certificate was generated and tested in a staging environment.

June 15, 2024 | 4:30pm

Deployment of the new SSL certificate was completed, restoring availability to the online payment services.

To effectively drive improvements in its incident response, Zylker examined the dependencies between various events to deduce potential triggers or root causes and identify existing gaps.

4. Analysis and investigations

This is the most important part of an incident report and lists out the underlying factors, from software bugs to hardware failures or human errors, that could have precipitated the IT incident.

In Zylker's story, its monitoring solutions detected a sudden spike in SSL handshake error rates. The company ruled out client-side sources as a common error message was reported by users. To ascertain the exact cause, Zylker delved into server configurations like cipher suites and examined the certificate validity. The latter unveiled that its SSL certificate expired and wasn't renewed.

By exploring past incidents and the causative factors, Zylker discovered systemic vulnerabilities, including manual certificate management, an outdated CMDB, and a lack of backup certificates. Now, the company can plan corrective measures to prevent such recurrences.

5. Remediation actions

After documenting the root cause, it's also important to record the mitigation and troubleshooting activities undertaken to restore normal operational levels.

For instance, Zylker generated a new SSL certificate and deployed it in a staging environment to ensure compatibility with its IT landscape. After deploying the certificate to its production servers, Zylker examined internal and external access to services, helping it validate the establishment of secure connections.

To overcome bottlenecks encountered during incident remediation, it leveraged Infrastructure as Code practices and testing tools that could simulate different load conditions, ensuring seamless staging and testing for SSL certificates.

6. Takeaways

After detailing remediation actions, the incident report should also present an account of successful and failed actions. Further, it should also capture suggested improvements, ranging from automating operations to training IT talent.

To illustrate, here's what Zylker planned to implement to prevent such recurrences:

  • Sending timely notifications to relevant stakeholders 30, 14, and seven days before certificate expiry
  • Automating the certificate renewal process
  • Setting up backup mechanisms to retrieve essential information if renewal fails

By documenting preventive and corrective actions, Zylker promoted a culture of learning and embedded best practices in its incident management strategy, enabling seamless adaptation to the ever-changing tech landscape.

Thus, with IT incident reports, digital enterprises like Zylker can arm their IT teams with a trove of insights, from patterns to preventive measures.

Building a solid IT incident report with ServiceDesk Plus

With ServiceDesk Plus, IT teams can collate crucial information from detection to resolution in a single window. They can gather extensive information with customizable incident templates while tracking details from monitoring tools within a ticket. By accessing configuration items from within the ticket, they can effectively assess the impact of an incident. With this context, they can zero in on the root cause with problem management. Further, they can trace the timeline of events from the history of operations carried out. Finally, they can keep tabs on the various resolution attempts, ensuring future efforts are well-guided.

By consolidating details across an incident's life cycle, ServiceDesk Plus arms IT teams with accurate information at their fingertips, facilitating the creation of rock-solid IT incident reports. To understand how ServiceDesk Plus can help you stay ahead of the curve, request a personalized demo.

About the author

Nisha Ravi

Nisha Ravi is an ITSM enthusiast who is keen on learning service management best practices and the latest tech advancements. As a ManageEngine ServiceDesk Plus product expert, Nisha works on developing articles and blogs that help IT service delivery teams address specific IT and IT service management challenges. A regular presenter at the ServiceDesk Plus Masterclass series, she delivers intense, hands-on product training sessions to ManageEngine customers. She also presents at the ManageEngine ITCON seminars, promoting ITSM best practices for IT practitioners across the globe.

Download our free IT incident report template for accurate incident documentation

  • Build a centralized repository of IT incidents effortlessly.
  • Gain effective insights and streamline your IT incident management strategy.

Here's your free copy

If your download doesn't start automatically, please click here.

Are you looking to replace your ITSM tool this year?*

By clicking 'DOWNLOAD NOW', you agree to processing of personal data according to the Privacy Policy.
Let's support faster, easier, and together