A BCDR blueprint for enterprises
Zoho’s BCDR framework
Purpose and Scope
Purpose
The BCDR plan lays out the steps and procedures that Zoho and ManageEngine will follow before, during, and in the wake of a disasters ( (e.g., natural disasters, man-made events, pandemic) so that we are resilient as a company, ensure maximum functionality during the emergency, and get our operations back to normalcy in the shortest possible time.
Key elements of the BCDR:
- Resilience: withstanding business interruptions in the face of adverse conditions
- Recovery: Getting back to business as quickly as possible after a disaster
- Contingency: Having a comprehensive set of measures and controls in place for a full recovery
- Continual improvement: continually reviewing the plan to make necessary revisions and keep the plan updated
Scope
The effectiveness of a BCDR depends on a well-defined scope. As Zoho is a larger enterprise and has distributed teams, scoping a BCDR is understandably more complex. There are many questions that we ask, answer and record when scoping the BCDR:
- Is it intended to cover all work sites, disaster prone sites, or the production center?
- Is it to cover all customers, or just a percentage of them?
- Is it intended to cover a local disaster, or a wide spread of disasters such as hurricanes and pandemics?
- What are our essential products and services?
- What are the critical processes and business units that MUST function in the event of a disaster? Example: Customer facing teams
Next, we validate certain assumptions. For example: skilled resources, team leaders or alternates will be available following a disaster
BCDR Governance
For many organizations tasked with BCDR, their first instinct is to immediately start to write a plan. However, experience tells us that a good governance structure is key to steering our BCM efforts and ensure there are no dead ends and pitfalls in the processes.
We have a control or governance system, BCDRC that is comprised of our board of directors and senior management executives of Zoho and ManageEngine. The BCDRC are brought on board early to steer our BCDR efforts and also to ensure that a) the right individuals are in the right roles to maximize our business continuity efforts b) the BCDR is kept ready and relevant at all times.
The following table highlights the roles and responsibilities of our BCDRC.
The roles and responsibilities of our BCDRC
Board of directors |
Senior management |
Understand and communicate the value of BCM and the risks in the absence of BCM |
Senior management team has a sound working knowledge of BCM practices and business risks. |
Review the organization's BCDR annually |
Keep the board of directors and C-suite executives informed of any significant changes to the business continuity plans |
Get frequent updates from senior management team for any newer business continuity policies and procedures |
Define Zoho's business management objectives, provide strategic inputs for BCDR, and designate the BCC. |
Direct and approve the planning, implementation, testing and other strategic objectives of the BCDR |
Review and approve during creation and updation of the critical processes, standard operating procedures (SOP) and planning exercises of our BCDR for each business units. |
Direct the audit committee to prepare for external audits |
Support and communicate the importance of BCM planning, training, and testing to all stakeholders. |
Direct the external communication plan to investors, customers, media and law enforcement authorities |
Assign the right middle managers to perform key BCM-related procedures and exercises |
Other roles and responsibilities
Who? |
Does what? |
Middle management (Business owners) |
|
Internal - Audit and risk management committee (ARMC) |
|
Risk management team |
|
IT incident management team |
|
Business continuity coordinators (BCC) are similar to incident coordinators (refer IM process here). The BCC create and maintain the BCDR and work closely with other critical business functions to understand their processes, identify risks, and also help manage and minimize those risks. |
|
Legal counsel |
TBD |
Other stakeholders |
|
Risk Assessment
Risk Assessment Methodology
The first and key step of BCM is the assessment of risks. Risk is the uncertainty of achieving the objectives, which affects our business in an adverse way. Risks are realized when:
- The objectives of the business is not achieved.
- There is non-compliance with organization's policies and procedures or external legislation and regulation.
- The resources of the business are not utilized in an efficient and effective manner.
- There is a violation of the Confidentiality, Integrity and Availability (CIA) of information.
It is important for Zoho to have an all-hazards approach to risk assessment and control processes in place to ensure that potential impacts do not become real, or if they do, there is a contingency plan in place to deal with them. It is also important that the process is sufficiently clear so that successive assessments produce consistent, valid, and comparable results, even when carried out by different people.
Establish the context
The scope of risk assessment is defined based on factors such as:
- Geographical location: Distributed data centers and office set up
- Business units or departments
- Business process(es)
- IT services, systems, and networks
- Customers, partners, products, or services
The overall environment in which the risk assessment is carried out should be identified and rationalized. This will include a description of the internal and external context and any recent changes that affect the likelihood and impact of risks in general.
Internal Context |
External Context |
Governance, organizational structure, roles and accountabilities |
The cultural, social, political, legal and regulatory environment |
Policies, objectives, and the strategies |
Financial, technological, economic, nature and competitive environments |
Capital, time, people, processes, systems and technologies |
International, national, regional or local environments |
Information systems, information flows and decision-making processes |
Key drivers and trends which have impact on the objectives of the organization |
Relatonships with, and perceptions and values of, internal stakeholders |
Relationships with, and perceptions and values of, external stakeholders |
The organization’s culture |
|
Standards, guidelines and models adopted by the organization |
|
Form and extent of contractual relationships |
|
The type(s) of cloud services provided |
Risk Identification
Although there are myriad disasters, the resulting effects are similar for most, and it is these we plan for. They result in scenarios such as loss of infrastructure or sustained IT failure. Preparing for the worst-case scenario helps cover many scenarios and risks in a single plan.
Our risk assessment team identifies, classifies, and assesses a wide range of disasters, especially those with catastrophically high impact potential, then characterizes their effects on business to enhance preparedness, response, and resilience.
Natural |
Willful |
Accidental |
|
Sub category |
|||
Geophysical |
Earthquake |
Bomb threat |
Chemical spill |
Volcanic eruption |
Terrorist activity |
Radiation contamination |
|
Landslide |
Civil disorder |
Heating systems or air conditioning failure |
|
Rockfall |
Bomb explosion |
telecommunication failure |
|
Meteorological |
Thunderstorm |
Bio weapons |
Network failure |
Lightning snowstorm |
Disasterous waste |
Gas leak |
|
Blizzard |
Employee strike |
Fire (internal) |
|
Tornado |
Cyber attack |
Wildfire (external) |
|
Hydrological |
Flood |
Disgruntled employees sabotage an organization's systems |
|
Tsunami |
|||
Avalanche |
|||
Climatological |
Drought |
||
Heatwave/coldwave |
|||
Forest/land fires |
|||
Biological |
Epidemic |
||
Pandemic |
Most enterprises have resilience plans for geophysical, willful, and accidental disasters and for IT disaster recovery. These plans that are effective for various business disruptions can fall short during a global pandemic like the COVID-19.
It's important that enterprises understand the significant differences between natural disasters versus the pandemic outbreaks so they can look beyond traditional business continuity strategies. At Zoho, we have established pandemic-specific policies and communication strategies to minimize business disruptions.
While natural disasters with physical phenomena are limited to a particular geography, biological ones such as viral pandemics spread globally. The table below lists the differences between the disruptions due to natural disasters and pandemics.
Disruption di erences between natural and biological disasters
Distinguishing factors |
Natural |
Biological |
Impact |
Affects the organization, facility, workforce and third parties. |
A systemic event that affects everyone globally including the organization and its workforce, customers, suppliers, competitors |
Exposure |
Can be contained and isolated as soon as the root cause is identified |
A contagion that spreads rapidly across geographies with severe impacts. |
Duration |
Shorter duration that varies from a few hours to a week. |
Longer duration with a viral pandemic lasting for several months. |
Workforce |
Temporary shortage and relocating of workforce |
Significant shortage of workforce that needs other alternatives like telecommuting |
External communication |
Emergencies should be reported to the appropriate law enforcement authorities, and health care assistance (e.g. police, fire station, ambulance) |
High degree of coordination with the local government, law enforcement, health care assistance. |
Infrastructure |
Affects public infrastructure availability like electricity, telecommunications, and internet |
Affects the global supply chain |
Compile/maintain asset inventory
The definition of an asset is taken to be "anything that has value to the organization" and needs to be protected. A full inventory of assets is compiled and maintained by Zoho using the ServiceDesk Plus application. This includes customer data that Zoho stores and processes in its role as a cloud service provider.
Two major types of assets are identified as:
- Primary assets — information and business processes and activities
- Supporting assets — hardware, software, network, personnel, site, organization structure
The list of assets is held in the document "Information Asset Inventory" and in the ServiceDesk Plus application. Within the inventory, every asset is assigned a value which should be considered as part of impact assessment stage of this process. Each asset also has an owner who should be involved in the risk assessment for that asset. Where is appropriate for the purposes of risk assessment, cloud customer data assets may be owned by an internal role and the customer consulted regarding the value of those assets.
For the purposes of risk assessment, it is recommended to group assets with similar requirements together so that the number of risks to be assessed remains manageable.
For each asset (or asset group), the threats that could be reasonably expected to apply to it will be identified. These will vary according to the type of asset and could be accidental events such as fire, flood or vehicle impact or malicious attacks such as viruses, theft or sabotage. Threats will apply to one or more of the Confidentiality, Integrity, and Availability of the asset.
Risk scenarios
The identification of risk scenarios is performed by a combination of group discussion and interviews with interested parties such as:
- Business unit manager(s) responsible for each business-critical activity
- Representatives of the people who normally conduct each aspect of the activity
- Providers of the inputs to the activity
- Recipients of the outputs of the activity
- Appropriate third parties with relevant knowledge
- Representatives of those providing supporting services and resources to the activity
- Any other party that is felt to provide useful input to the risk identification process
The identified risks along with a description are recorded to assess the likelihood and impact of the risks.
Disasters and risk scenarios identified in the last decade
Hazards |
Risk Scenarios |
Earthquakes |
Irreversible damage to IT infrastructure |
Floods |
Half of core revenue generating business units |
Tsunami |
|
Pandemic |
Loss of Zoho production center( Zoho Estancia building) & data centers |
Ransomware |
Loss of critical customer data |
DDos attacks |
Absenteeism of critical employees |
Telecommunication failure |
Loss of access to our worl sites |
Network failure |
Interruption of supply chain |
Risk analysis
This process involves assigning a numerical value to the a) likelihood and b) impact of a disaster. These values are then multiplied to arrive at a classification level of high, medium, or low for the disaster.
Assessing the likelihood
An estimate of the likelihood of a disaster occurring is made. This should take into account whether the disaster has occurred before either to Zoho or to similar organizations or location and whether there exists sufficient motive, opportunity, and capability for a threat to be realized.
The likelihood of each disaster is graded on a numerical scale of 1 (low) to 5 (high). General guidance for the meaning of each grade is given in table 1. When assessing the likelihood of a disaster, existing controls is taken into account and that means an assessment has to be made on the effectiveness of existing controls. The rationale for assigned grades to a disaster risk is recorded to aid understanding and to be repeated in future assessments.
LIKELIHOOD |
||
PROBABILITY |
EXPLANATION |
SCORE |
LOW |
An event that never occured |
0 |
An event that is highly unlikely to occur of occurs rarely (perhaps once in 3 years) |
1 |
|
MEDIUM |
An event likely to occur relatively infrequently, perhaps once a year |
2 |
HIGH |
An event that is fairly probable, and could be expected to occur several times a year |
3 |
Assessing the impact
An estimate of the impact that the disaster risk could affect the Confidentiality, Integrity or Availability on the organization is given. This will take into account existing controls that lessen the impact, as long as these controls are seen to be effective. Consideration will be given to the impact in the following:
- Customers
- Finance
- Health and Safety
- Reputation
- Knock-on impact within the organization
- Legal, contractual or organizational obligations
The impact of each risk is graded on a numerical scale of 1 (low) to 5 (high).
PROBABILITY |
EXPLANATION |
SCORE |
LOW |
No impact |
0 |
Negligible or less impact with less effort to repair |
1 |
|
Damage to reputation or revenue loss is minimal |
||
MEDIUM |
Tangible harm, extra effort required to repair |
2 |
Damage to reputation or revenue loss is significant |
||
HIGH |
Significant expenditure of resources requires and compromise of the system |
3 |
Damage to reputation and revenue loss is high |
Risk classification
Based on the assessment of the grade of likelihood and impact, a score is calculated for each risk by multiplying the two numbers. This resulting score is then used to decide the classification of the risk based on the matrix.
Risk formula
Each risk will be allocated a classification based on its score as follows:
RISK VALUE |
RISK LEVEL |
COLOR CODING |
0-3 |
LOW |
|
4-6 |
MEDIUM |
|
7-9 |
HIGH |
Note: Based on our risk appetite, we do change the definition of high, medium, and low classifications. For example: We may decide that only risks with a score of 16 or more
Risk evaluation
Risk acceptance criteria
Risk treatment will not be done for the risks which are ranked in the “Low” risk level. If the value is rated as 3, no actions are taken. If the value is rated as >= 4, the actions will be initiated. Risk treatment can still be done for the “Low” risk category, should the BCDRC decide to do so.
We evaluate risks to decide on the risks that can be accepted and the ones that need to be treated. This should take into account the risk acceptance criteria. The matrix above shows the classifications of risks, where the green indicates that the risk is below the acceptable threshold and could be regarded as “safe”. The orange and red areas generally indicate that a risk does not meet the acceptance criteria and needs to be treated. Risks will be prioritized for treatment according to their score and classification so the high scoring risks are recommended to be addressed before those with lower levels of exposure for the organization.
Risk assessment report
The results derived from risk evaluation is captured in the risk assessment report with the following information:
- Assets (asset-based risk assessment only)
- Threats
- Vulnerabilities
- Risk scenario descriptions (scenario-based risk assessment only)
- Controls currently implemented
- Likelihood (including rationale)
- Impact (including rationale)
- Risk score
- Risk classification
- Risk owner
- Whether the risk is recommended for acceptance or treatment
- Priority of risks for treatment
Note: The risk assessment report holds the inputs to the risk treatment stage of the process and is signed off by the BCDRC before proceeding further, particularly those risks that are recommended for acceptance.
Risk treatment
Risk treatment is a process to develop a range of options for mitigating the risks that are agreed to be unacceptable. We apply the following measures to treat the risks:
- Modify the risk by applying appropriate controls to lessen the likelihood and/or impact of the risk.
- Avoid the risk by taking actions that means it no longer applies.
- Share the risk with another party. For example: insurer or supplier.
We use our judgement to decide which course of action to follow based on a sound knowledge of the circumstances surrounding the risk. Example: Business strategy, regulatory and legislative considerations, technical issues, commercial and contractual issues.
Note: The risk reviewer ensures that all parties who have an interest or bearing on the treatment of the risk are consulted, including the risk owner.
Risk treatment plan
On evaluating the treatment options, the risk treatment plan is created with the below details:
- Risks requiring treatment
- Risk owner
- Recommended treatment option
- Control(s) to be implemented
- Responsibility for the identified actions
- Timescales for actions
- Residual risk levels after the controls have been implemented.
Statement of Applicability (SOA)
The SOA sets out those standard controls that have been selected and the reasons for their selection. It also details those that have been implemented and identify any that have been explicitly excluded along with the reasons for exclusion.
BCDRC Approval
At each stage of the risk assessment process, the BCDRC is kept informed of the progress and the taken, including the formal signoff of the proposed residual risks. The BCDRC approves the following documents:
- Risk assessment report
- Risk treatment plan
- Statement of Applicability (SOA)
The acceptance or treatment of each risk will be signed off by the relevant risk owner.
Risk Monitoring and Reporting
As part of the implementation of new controls and the maintenance of existing ones, key performance indicators (KPIs) are identified which allows the measurement of the success of the controls in addressing the relevant risks. These indicators are reported on a regular basis and trend information is produced so that exceptional situations are identified and dealt with as part of the BCDRC review process.
Regular Reviews
In addition to a full annual review by ARMC, risk assessments are evaluated on a regular basis to ensure that they remain current and the applied controls are valid and relevant. The relevant risk assessments are also reviewed upon major changes to the business such as office moves, mergers and acquisitions or introduction of new or changed IT services.
Business Impact Analysis
While some business functions maybe relatively unimportant, some are absolutely critical to ongoing business. The BIA process makes it easy to pinpoint the most critical business functions, their interdependencies, and whether they should be considered for inclusion into the business continuity strategy. It also helps us identify how these core functions can be impacted by disasters, and also lays the groundwork for more systematic and logical recovery plans.
Furthermore, doing this analysis makes us more confident and secure about our business decisions, knowing fully well that our decisions are based on a solid understanding of the most essential components of our business.
The core objectives of BIA are as follows:
- Prioritize business-critical units or departments, products, and services that must be protected
- Create an inventory of essential business activities and the minimum resources required to conduct business as usual or almost.
- Establish recovery time frames or recovery time objectives (RTOs) to help prioritize risk treatment plans and select the appropriate response and recovery strategies.
As shown in the process activity diagram below, BIA is a multi-phase process performed by BCC.
BIA Interviews
The BCC take stock of all business units and gather some basic information before the actual interview using a Zoho Creator form. A link with the questionnaire is sent out as an email in the name of a department head along with a note of what the BCC are trying to accomplish through this exercise and why it's important. A reasonable amount of time (around 2 weeks) is given to the concerned teams to complete the task. This prework sets the stage for a more focused and effective BIA interviews and also cuts down the time.
The BCC initially ask the below questions:
- Name of the business unit?
- What the business unit does?
- How many resources does the business unit have?
- Where is the business unit located?
- What are the hours of operation? Does it involve shifts?
Tip: Choose a data gathering model that is least time-consuming and one that is more aligned with how you work in your organization. Any effort that's not part of your mainstream business activities such as business continuity, disaster recovery, and compliance are usually low on priority for your business units, and any steps that you take to reduce the effort to gather the data can pay off.
Gather information
The BCC hold a kick off meeting to hand out the questionnaire to the department heads and to clearly articulate the purpose of the whole exercise. The questionnaire covers all the required data points as the final output of the BIA relies on this step.
Below is a sample questionnaire from the BCC:
Sample questionnaire from the BCC
Data points |
Questions |
IT related questions |
Business unit and processes |
Describe your business unit and its processes? |
What IT systems and applications does this business unit use? |
Dependencies |
What are your dependencies with other business units? Would a disruption of this business unit impact others? How and when would this disruption to other units happen? |
What are the IT systems that impact or are impacted by this business unit? |
Resource dependencies |
Does this business unit depend on any key job functions? If yes, then what is the job function and to what extent does this business unit depend on the job function? What is the minimum number of resources needed for this business unit to function? |
What are the secondary systems (if any) needed for these job functions? |
Expertise dependencies |
Does this business unit depend on the knowledge and expertize of a skilled worker? If yes, describe the role and expertize of the skilled worker and the impact on business in their absence. |
|
Operational |
If this business unit did not function, how would it impact business? |
If this business unit did not function, how would it affect IT operations? |
Tolerance to outages |
In the face of a disaster, such as loss of production center (Zoho Estancia), how long can the business unit/systems sustain before the loss impacts the organization, its stakeholders, and suppliers? |
|
Minimum infrastructure requirements |
What are the infrastructure requirements for the your business unit: physical space, office supplies, network, communication, furniture, lighting, HVAC, water, and food supplies. |
|
Others |
Other concerns (if any) that can affect the recovery of your business unit? |
|
Alternate business processes and resources |
What are workarounds currently in place for your business processes? Who are the alternate or back up resources? |
|
Critical documentation |
Where do you store your critical documents? Mention the type of documents, location, and alternate locations (if any) |
|
Recovery timeframes |
What are the potential recovery issues that your business unit can face? What's the minimum recovery time frame? Who are the essential resources needed to restore operations to a near-normal state? |
|
Financial impact |
If this business unit did not function, how would be the financial impact on business? When would the impact be realized? Will it be a one-off impact or recurring? |
|
Recovery time frame |
What is the minimum time frame (in hours, days, weeks, months) to recover this business unit? |
How long would it take to recover or replace the IT systems/applications related to this business unit? |
Service level agreements |
Are there any service level agreements in place for this business unit? In the event of a disaster, what would be the impact on SLAs? What are the key metrics associated with the SLAs? |
What's the impact on IT service levels be impacted during disruption of this business unit? |
IT applications |
What software applications are needed for this business unit? |
What IT assets are needed to run these applications and to support this business unit? |
Desktops, laptops, workstations |
How many desktops, laptops, workstations are needed for this business unit? |
What is the configuration data for these systems? |
Servers and networks |
Does this business unit require backend systems and network? |
|
Workarounds |
Does this business unit have any workaround processes that have been developed and tested? If yes, would these processes facilitate the smooth function of this business unit during an event? If no, is it feasible to develop such workarounds? |
Are there any IT-related workaround for this business unit? If yes, what are those workarounds and how can they be implemented? |
Remote |
Will this business unit be able to work from backup recovery sites of Zoho? OR work remotely from home? |
What should IT do to enable remote access for this business unit? |
Vital records |
Where does this business unit store critical documents? Are these documents backed up? If yes, where and how frequently does the business unit back up documents? |
Where are the document backups stored? Is the current document backup strategy sound enough? |
Previous business disruption experience |
Has this business unit faced any disruptions earlier? If yes, what was the disruption scenario and duration? Any learnings that can be incorporated into the BCDR to prepare for future disruptions? |
Has IT been involved in this disruption scenario? If yes, how did IT address this disruption? |
Competitive impact |
What would be the competitive impact to Zoho if this business unit faced significant disruption? What percentage of customers would we lose? |
The BCC conduct follow up interviews to validate the gathered information and to fill up any gaps.
Analyze the information
The questionnaire is created to gather information as the financial and non-financial impacts, recovery timeframes, resource, and application requirements. The BCC compile and analyze the responses to provide the required information to develop a corporate-wide recovery and continuity strategies.
The below table captures some of the most important impact categories that we consider. This table can be used as a checklist by other IT organizations while conducting BIA.
What's at stake for Zoho?
Impact categories |
Impact |
Financial impact |
|
Infrastructure |
|
Resource |
|
Health and safety |
|
Legal |
|
Strategic |
|
Intangible |
|
The information gathered in the BIA interviews is used to:
- Identify the critical business units and processes
- Define the recovery time objective (RTO) for each business process.
- Define the recovery point objective (RPO) for each business process
- Identify resource requirements
Identifying critical functions
In the big picture, how critical is each business unit and their processes to Zoho's ability to operate? A four point rating system helps the BCC assign a "criticality rating" to a business unit and its functions.
CATEGORIES |
CRITICALITY |
COLOR CODING |
1 |
Critical (mission critical BUs and processes) |
|
2 |
Important (necessary BUs and processes) |
|
3 |
Minor (Desirable BUs and processes) |
Category 1
Critical business units and processes are those that are:
- most sensitive to downtime
- maintain cash flow
- fulfill service level agreements
- play a key role in maintaining Zoho's business reputation.
The BCDR focuses more time and resources on the critical BUs and functions first, followed by the important BUs and functions.
Category 2
Important business units and processes don't affect Zoho's business operations in the near term. However, if they are not functional for a longer term, they can cause some disruption to the business.
Category 3
Minor or desirable business units and processes do not cause significant business disruption to business. They are usually dealt with in the later stages of business recovery.
Recovery time objective
Once the impact data is analyzed, the BCC define the recovery time objectives (RTO). RTO is the time in which a business process should be restored following a disruption. This depends on the criticality of a business unit, process, and application and range anywhere between no downtime to several days or weeks. Simply put, “How long can we be down?”
This timeframe can vary by organization — for some IT organizations, the recovery time for processes can be as low as 0 minutes.
CATEGORIES |
CRITICALITY |
COLOR CODING |
1 |
Critical (mission critical BUs and processes) |
12 hours or less |
2 |
Important (necessary BUs and processes) |
48 hours or less |
3 |
Minor (Desirable BUs and processes) |
< 3days |
Recovery point objective (RPO):
RPO defines the maximum acceptable data loss that can be tolerated by a critical business process. Simply put, if the IT systems supporting a critical business process were to fail, how much data can be recovered? We use three time frames here and this can also vary by organization.
RPO 0 — no data loss (real time back ups)
RPO 1 — less than 4 hours data loss
RPO 2 — 24 hours data loss
Identifying resource requirements and dependencies
The BCC document each department and process along with the resource(s) responsible for the processes of a business unit. A list of backup resources for the process is also identified in case the lead resources are unavailable during an emergency.
The BCC also identify the systems, applications (be it a CRM, payroll, HR software), and the level of access needed to get their jobs done. The level of reliance of a business unit on these systems and applications is rated as high, medium, or low in order to ensure the availability of crucial systems and application during an emergency.
A thorough understanding of interdependencies between business units, their functions, and IT systems is crucial to both disaster recovery and business continuity. If system A is down during an event, it's pointless for our IT teams to spend a week trying to restore System B, if System A is still out of function. The BCC document and highlight these interdependencies at this stage to ensure the effectiveness of business continuity.
BIA Report
The outcome of the BIA is documented as a BIA report with recommendations of recovery strategies and presented to the BCDRC for approval. This report is also appropriately incorporated into our IT disaster recovery and incident management plans. Here is a sample BIA report of one of our BUs - IT operations.
BU Head |
BU name:Network operations center (NOC) |
BU head:Prabhu Ponnukumaraswamy |
Email ID:xxxx@zohocorp.com |
Mobile:+919999999999 |
Headcount |
50 |
Priority |
Critical |
Business unit functions |
|
Business unit disruption impact |
|
RTO |
15 minutes. |
RPO |
0 |
Internal dependencies |
Human Resources, Finance, Facilities, and Security. |
External dependencies |
|
Recommendations |
|
BCDRC Approvals
The BIA report is sent to BCDRC for their perspective and approval as the BIA results is used to formulate recovery strategies and continuity planning. The BIA goes through a multi-step approval process. The first level of approval is done by the BIA owner and the final go ahead is given by the BCDRC.
BCDR Planning
The bulk of our work in developing our BCDR plan is almost complete when we get to this point. This section is where everything comes together - the risk assessment we performed gave us the data that helped us identify the business impact those risks can have on our business. Finally, all of that data is now going to help us identify the disaster response, mitigation, and recovery strategies, as well as the people, resources, and activities that we need for effective BCDR.
The BCDR plan includes two phases:
- Emergency response procedures that all Zoho worksites will follow as the appropriate emergency response to disasters like fire, flood, and earthquakes to protect employee lives and limit damages.
- Disaster recovery and business continuity activities conducted after the disruption for the restoration of business operations.
Roles and responsibilities
One of the crucial steps in emergency response and recovery is assigning roles and responsibilities. When disasters strike, the response teams on the scene are our first line of protection.
These teams help contain the impact of the disaster and effect a timely recovery before the first responders such as police or firefighters arrive at the disaster site.
Below are the response teams and responsibilities.
BCDR Roles and Responsibilities |
|
Emergency personnel |
|
Security personnel |
|
Head of facilities |
|
In house medical officers |
|
Ambulance service |
|
Employees |
|
Disaster recovery team |
|
Emergency management team |
|
IT |
|
Notification procedures
- If in-hours: On observation or notification of a potentially serious situation, (example: fire) the employee identifying the incident (reporter) calls their BU head. If the BU head is unreachable or incapacitated, the reporter calls their backup, a senior manager.
- The BU head/backup notifies the emergency personnel on site (who carry out the standard emergency and evacuation procedures if necessary) and the EMT and DRT.
- If out of hours: IT personnel notify the EMT and DRT.
- The EMT, DRT, and other response teams respond based on the directives specified by BCDRC.
- When a disaster is declared, the EMT and/or DRT will notify IT immediately for deployment.
- The person who is authorized to declare a disaster within the BCDRC has a backup who is also authorized to declare a disaster in the event the primary person is unavailable. For example: CEO - primary authority, COO - secondary authority.
A call tree is a general notification technique that we use to list the primary and alternate contact numbers of key personnel as well as the back up personnel numbers in the event that the key personnel is unreachable. The contact list includes the name, department, role, mobile number, residential number and address of the key and backup personnel.
Disaster declaration
A disaster is declared only when the emergency is not likely to be contained and resolved within predefined time frames. The BCDRC is responsible for declaring a disaster and has to be well informed about the geographical, political, social, and environmental events that can pose a threat to Zoho's business operations. To avoid false alarms, the BCDRC has identified institutions that provide timely and meaningful disaster predictions that allows Zoho to respond and recover effectively. Below are a few identified institutions that help the BCDRC with disaster monitoring for regional work sites.
Type of disaster |
Early warning/prediction systems( For regional work sites ) |
Cyclones and earthquakes |
Indian meteorological department and earthquake sensors |
Tsunami |
Indian meteorological department and earthquake sensors |
Cyclones and earthquakes |
Indian national centre for oceanic information services |
Floods |
Central water commission |
Invoking the plan
Like every IT organization, we hope to never have to invoke the BCDR. However, emergencies can arise at any time and we believe in readiness. The BCDR is reserved for significant disasters and business disruption and is invoked by BCDRC.
Regardless of the service disruption circumstances, or the identity of the individual(s) in the BCDRC who are first notified of the disaster, the EMT and DRT are activated immediately in the following cases:
- The production center at Estancia is down due to a natural disaster like flood, earthquake etc.
- Any disruption in the IT systems or network facility that can cause concurrent downtime in the production center for more than three hours.
Internal communication
Effective internal communication is key to ensuring that employees are well-informed, supported, reassured, and most importantly safe during a disaster. Ideally, a face-to-face communication is effective for relaying messages to stakeholders during a disaster. However, at Zoho, a forum post from the CEO and HR with key messages surrounding the disaster and BCDR on Zoho Connect (A team collaboration software, like an internal Facebook-like application that connects all stakeholders and enables collaboration during a disaster.) is an effective alternative.
In addition to the forums on Zoho Connect from the BCDRC and HR teams, the BU heads are the focal points for their departments to provide updates on the progress of their disaster recovery and business continuity efforts and how they can contribute to the recovery efforts.
Initial response
It might sound obvious but the BCDR prioritizes our employees and their lives over assets.
The emergency response procedures taken in the initial minutes of an emergency are critical to saving the lives of our employees. Our emergency procedures captures four protective actions: evacuation, shelter, shelter-in-place, and lockdown and relevant p rocedures. These emergency actions apply to all employees (including management personnel), and to all work sites of Zohocorp.
Authority
The instructions and guidance given by Zoho's trained emergency personnel overrules the reporting structure. This authority is given to the emergency personnel to ensure that the life and safety of the employees takes precedence over IT systems, other assets, and production during an emergency.
Assembly points
The BCDR plan identifies two assembly points both inside and outside Zoho premises where employees should gather after evacuating. These areas of refuge have sufficient space to accommodate all of Zoho's employees and are away from buildings, power lines, trees, gas lines, poles, and vehicles.
The BCDR plan identifies two evacuation assembly points
- Primary - Open ground behind Zoho
- Secondary - Open ground across the street opposite to Zoho
Protective action and emergency procedures by disaster types
Disaster Type |
Protective Action |
Procedures |
Fire/smoke |
Evacuation |
|
Flood/water damage |
Evacuation |
|
Tornado/cyclones |
Shelter |
|
Earthquakes |
Shelter-in-place |
|
Terrorist attack |
Lockdown |
|
Emergency contacts
In case of emergencies we call for help, information, and services on these emergency hot lines.
Emergency Crisis Hot Lines (Regional) |
|
National emergency hotline |
|
Disaster management services |
|
AIIMMS |
|
Air ambulance |
|
Red cross |
|
Gas leak |
|
Fire department |
|
Police department |
|
Hospital |
|
Medical services (mobile) |
|
Ambulance services |
|
Utility Companies |
|
Network provider |
|
Gas |
|
Plumbing |
|
HVAC |
|
Electricity board |
BCDR activities
Here is what an emergency scenario at Zoho can look like as the recovery activities unfold. The activities below are some of the recovery activities in case of fire, flood and earthquakes, and of course, will vary depending on the nature of the emergency and its impact on business.
How is Zoho geared for eventualities?
Timeframe |
Activities |
First 4 hours |
External communication: Our communications team collects information from reliable sources and crafts key messages (before, during, and after the disaster), as well as ensures a consistent message across all channels: website, blog, media, news release, social media et cetera. The team holds a ready list of potential external audiences: emergency medical services, fire department, police, local government, suppliers and vendors along with their contact numbers. Two official spokespersons, President/Vice President of Zoho and ManageEngine, with a solid experience in working with both print and broadcast media, will be the primary contact for all media inquiries. The spokespersons typically run all press conferences and give the most analyst and partner interviews during a crisis. All external communication will include details of the disaster including the date and time of occurrence, a description stating the impact of disaster on business, steps being taken to mitigate the risks, recovery, and business continuity, and estimated time for recovery. Emergency command centers (ECC):Our emergency command centers are the coordination hubs for disaster response. The BCDRC and response teams personnel gather critical information, coordinate response and recovery activities, and manage employees as the emergency situation demands from these centers. Emergency command center 1: Estancia IT Park, Chennai, India Emergency command center 2: Tenkasi, India Alternate locations: In case of temporary or permanent loss of a disaster struck facility, the 12 offices spread across different countries act as alternate locations to each other. We move our critical business functions to alternate sites that are equipped to provide similar working environments as other sites. Alternate sites may include (but not limited to):
Critical teams and resources: In the BIA phase of this plan we already identified the critical teams and employees that are considered essential during an emergency or disaster. These critical BUs such as customer-facing teams (presales, sales, customer support) and their resources are moved to alternate locations. Minimal resources from other critical BUs such as HR and facilities report to work regardless of conditions. Availability: Application data is stored on resilient storage that is replicated across data centers. Data in the primary DC is replicated in the secondary in near real time. In case of failure of the primary DC, secondary DC takes over and the operations are carried on smoothly with minimal or no loss of time. Both the centers are equipped with multiple ISPs. We have power back-up, temperature control systems and fire-prevention systems as physical measures to ensure business continuity. These measures help us achieve resilience. The live status and historical status data (30 days) of cloud services can be seen at status.zoho.com / status.zoho.eu / status.zoho.in / status.zoho.com.au. Disaster-ready data backups: Data backup and recovery is critical for recalling data during natural disasters. At Zoho, we perform full and incremental backups to preserve corporate information. These backups are performed on a regular basis for audit logs and files that are considered critical. The backup media is stored in a secure offsite data center, geographically separate from the original. |
5-24 hours |
Succession plan: In case of casualties, activate the succession plan that lists who replaces the BCDRC, senior managers, managers, team leads during an emergency if they are not available to carry out their responsibilities. Stabilize the situation: The disaster situation is stabilized to save lives, and is usually done at the response stage. However some stabilization activities such as removing records from the disaster location, and isolating affected systems are done before damage assessment to prevent further damage to the records and information, as well as the assets. Damage assessment: Once a disaster is declared, the DRT should be mobilized. Damage assessment is done as quickly as conditions permit by the DRT (under the direction of the location authorities) to assess the damage to:
Damage assessment helps us gauge the extent of damage: what can be replaced, salvaged, or reconstructed. The results of the damage assessment are documented in the damage assessment and evaluation form (Check forms section below for a complete list of forms that we use during emergencies). This helps develop a restoration priority list, identify facilities, vital records, and equipment needed for resumption of activities. The EMT and DRT gather all the information regarding the event and send for BCDRC's review. The decision to move to the business continuity phase is made at this point. If the situation does not warrant this action, then the EMT and DRT continue to address the situation at the affected site(s). Supply chain: In times of disaster, our supply chains that were functioning well can experience significant disruption. We've identified a list of key back up vendors for all essential equipment and supplies so we can switch to these vendors in the event the primary vendor is also affected by the disaster. |
Days 2-4 |
Salvage operations at disaster site: The salvage operations now begin for damaged IT systems, furniture, workstations, and records with appropriate procedures. The activities include:
Move critical resources back to primary site: As soon as the primary site is stabilized and repaired, the critical resources are moved back into the primary site. |
Days 5-14 |
Bringing back business as usual (BAU): In the event of total facility destruction efforts begin for fully rebuilding the facility, while the critical employees continue to work from alternate locations, and other employees work from home. In case of partial damage, the facility is rebuilt in the shortest time possible and all employees are moved into the primary facility. Once all the IT systems, records, data, supplies are restored and normalcy returns at the organization, external communication is sent out to customers, partners, press, and concerned authorities. |
Forms
Disaster form
In the event of a disaster, the on-duty personnel make the initial entries into a disaster form. This form captures a chronological log of the business impact reported during the event. It is then forwarded to the ECC, where it is continually updated. The running log remains active until the disaster ends and its business as usual.
Date and time |
Type of event |
Location |
Building access issues |
Projected impact to operations |
Running log (ongoing events) |
Critical equipment status assessment and evaluation form
Date and time |
Type of event |
Location |
Equipment |
Condition |
Salvage |
Comments |
Critical equipment status form:
OK - Undamaged
DBU - Damaged, but usable
DS - Damaged, requires salvage before use
D - Destroyed, requires reconstruction
BCDR Approvals
Once the BCDR plan is completed along with the estimated costs for recovery, it is sent for a formal approval to the BCDRC. The BCC get the support and buy in of the senior management to emphasize the senior management's commitment to the BCDR process and its importance.
Implementation and Training
So we've now created a BCDR plan and it's now part of our mainstream processes and policies. The last step in being BCDR-ready is regularly training all those who use the BCDR plan, and also those employees who aren't part of its development is critical to the success of the plan. The training can be walk-throughs, mock disaster drills, or component testing.
The DRT and EMT teams choose disaster scenarios that can realistically happen. For example, they can build a scenario around a fire accident to conduct mock fire drills. The fire drills are conducted every six months to check the reaction of the employees, the efficiencies of the fire alarm and fire fighting systems, execution of evacuation procedures by the emergency personnel, and the disaster response and recovery activities.
We also train our IT teams in disaster recovery activities to get them up to speed as they are instrumental in keeping our systems available and accessible in an emergency.
Plan Review and Maintenance
The BCDR plan review and maintenance are closely tied as maintenance of the plan requires a review from time to time to ensure that the plan stays current and that any changes to the infrastructure or personnel details is updated in the plan. As part of our continual improvement efforts, the BCC gather lessons learned from disaster experiences and mock drills, and update the plan with new information gleaned from these experiences.
The updates, revisions, and approvals to the changes in the BCDR plan is done using the change management module in our ITSM tool.