Incident management handbook: How Zoho manages the spectrum of IT incidents
Whew! That was a close call. Let's hope that never happens again!
Over the last decade, we’ve combated thousands of incidents.
As bootstrappers, we’ve experienced low-impact incidents that typically needed fewer technicians but still required a well-established incident management (IM) framework. For most incidents in the past, we relied on problem solving by individuals. However, as our IT infrastructure grew, we faced more complex and high-impact incidents, forcing us to up our IM game.
Soon, we realized there is no one-size-fits-all process to manage all the different types of incidents our organization faced. So, we took the frameworks that were most effective and added, combined, or omitted steps to handle every type of incident based on their impact and our business operations. This ensures each response is well-tailored to the challenges presented by each incident.
The result? Our incident process now extends beyond established industry frameworks. Our IM frameworks are classified based on the severity and impact the different types of incidents have on business operations.
IM framework |
Impact |
Scenarios |
|
Desktop sprint |
Break/fix incidents that affect an individual user |
|
|
Low or medium-impact incidents that affect user groups or departments |
|
|
|
Big bang |
|
|
|
CyberSec (Showstoppers) |
|
|
|
When it comes to IM, there is no one-size-fits-all solution as every organization is different. What will work for your organization will depend on your business model, infrastructure, operations, the information you are protecting, your resources, and more. Recognize that some techniques only come with time and experience. This should not, however, discourage you from getting started!
Who is this guide for?
This e-book is written for IT leaders, managers, and practitioners from a service management perspective. We will walk you through our IM processes with illustrated process flows, roles, and best practices. This guide is full of lessons we've learned through trial and error—so you don't have to.
Before we dive in, let's get the basics out of the way.
What is an incident?
An incident is an unplanned interruption that causes, may cause, or reduces the quality of an IT service. Some classic examples are the internet running too slow, a business application going down, or a printer not working.
Truth is, we can define an incident in many ways. What matters most is that every incident should have a well-structured, timely response and resolution.
What is incident management?
Incident management is a way to restore normal service operations as quickly as possible, minimizing any adverse impact on business operations or the user.
Our incident values
Incident principles |
Approach |
Be proactive, not reactive |
|
Be open, and communicate |
|
Align teams, collaborate effectively |
|
Bounce back quickly |
|
Document the lessons |
|
Continually improve |
|
Our IM tools
We utilize several tools to aid our IM processes.
Desktop incidents
Track & manage incidents:
ServiceDesk Plus Cloud is customized to fit our incident management processes.
Password resets:
ADSelfServicePlus is a self-service password reset tool.
Password management:
Password Manager Pro is a secure vault for storing and managing shared sensitive information such as passwords, documents, and digital identities of enterprises.
Endpoint management:
Endpoint Central, a unified endpoint management solution, helps manage servers, laptops, desktops, smartphones, and tablets from a central location.
Major availability incidents
Alerting tool:
We use Site24x7 to monitor the availability of servers and applications.
Security incidents
Bug Bounty program:
Bug Bounty is a third-party tool for employees and individuals to report bugs, like exploits and vulnerabilities.
Communication
Note:
We also use social networking sites, messaging platforms like What’s App, and phone calls as alternative ways to communicate should Cliq go down, as it’s important to have alternative means of communication during a disaster.
Documentation:
Zoho Docs is a central system for storing all incident and root cause analysis (RCA) documents.
Chat:
Zoho Cliq is a real time business messaging app that helps our employees communicate effectively with each other anytime, including during an incident.
Collaborate:
Zoho Connect is collaboration software that ensures that all teams can be on the same page when resolving incidents. Some call it the Facebook of our workplace.
Our incident management command center (IMCC)
Our incident management command center (IMCC) is a large secure room with big, NASA-like screens of monitoring devices to provide detailed metrics and visibility, enabling our IM teams to react quickly and troubleshoot effectively during incidents. This room hosts three core teams: the network operations center (NOC) team, the Zorro team, and the central system admin team. We have dynamic access control in other work sites to perform monitoring activities.
-
Previous Chapter
‹ - ›