ManageEngine's ITOM framework

Setting up a monitoring apparatus
Managing everyday operations
Maintaining a stable ITOM environment

Above are the three main components of our ITOM framework. We have two designated teams to manage this framework:

1) Network operations center (NOC) team:

This team monitors the LAN, WAN, and network devices out of a central location: our corporate office in Chennai. Its scope of work includes analyzing problems with network devices (switches, routers, and firewalls), troubleshooting issues, reporting incidents, communicating with site technicians, and tracking problems through resolution.

2) Zorro team:

This team manages servers and load balancers in our DCs. Its scope of work involves maintaining, managing, and continually improving the performance of Zoho's DCs.

Responsibilities of Network operation center

1. Setting up a monitoring apparatus

Monitoring networks

Our monitoring apparatus gives us a comprehensive view of our network. ManageEngine's network monitoring has two major functions:

Monitoring corporate offices

Monitoring DCs

We have a 24/7 command center equipped with screens giving us visibility all across our network in both corporate offices and DCs.

We have DCs in India, the US, Europe, China, Australia, and Singapore. In most of these countries, we have multiple sites that hold our customers' data. We are also expanding our DCs to more locations because our user base increases every day.

We have our corporate network also, with our main corporate office in Chennai and backup offices in Tenkasi and Renigunta.

To monitor such a widespread network, we need three things:

Visibility across the entire network

Alerts to notify us of abnormalities

A team of technicians who offer 24/7 support
by working in rotation

Workflow of network monitoring

The NOC team uses in-house tools and a few external applications to monitor network performance and the uptime of network devices in all DCs. We use monitoring tools, configuration management tools, traffic and log analyzers, and more. These tools constantly ping the network devices for status updates, and if an error is detected, they generate an alert. The alert details are sent via email, which is then logged as a ticket automatically in our IT service desk.

Then the NOC engineer contacts the remote support engineers at the respective corporate office or DC via phone to resolve the issue. Upon resolution, the original alert is cleared in the monitoring tool and the ticket is marked as closed by the NOC engineer.

The NOC team monitors the following devices:

Switches:

The NOC team monitors the uptime, availability, and resource usage of switches using our monitoring tools. It also uses log analyzers to receive and analyze the alerts from syslog messages. We also analyze the interface traffic in switch ports using our in-house traffic monitoring tools.

Firewalls:

Similar to switches, the NOC team monitors the uptime, availability, resource usage (CPU, memory, and disk), and redundancies in firewalls using our monitoring tools.

Routers:

In our integrated DCs (IDCs), we monitor memory utilization and interface usage. Our ISPs handle routers in our corporate offices.

Monitoring DC performance

We use our in-house monitoring tools to get insights into our performance and dashboards to understand the trends of monitored parameters. We use the load balancer dashboard, event dashboard, service map dashboard, and more to understand how various components in our DCs are performing. Above all, we use a monitoring apparatus that gives us an eagle eye on all our DCs.

If a certain component or process within a component is abnormal, the dashboard will display it as an event while sending an email alert to our Zorro team and the other teams concerned.

The dashboard we use the most is the network dashboard, which is unique for each DC. We use the following insights from the network dashboard to understand the performance of our DCs and to take necessary action:

  • Availability of servers
  • Available disk space
  • File system errors
  • Hardware failures
  • RAID status
  • Application ports down (like MySQL and web servers)
  • DNS lookup failures
  • DNS time lag
  • Power usage of various clusters
  • Backup status
  • Server uptime

If any one of these parameters is not functioning according to our set of standards, we receive alerts on our visual monitoring dashboards. These alerts remain on the screens until an engineer from the Zorro team resolves the errors.

In addition to the other monitoring tools we use for DCs, we use Site24x7, our in-house monitoring service, to monitor the availability of Zoho's services from different geographic locations across the globe. Like in our DCs, we establish automated email alerts for the respective application teams and the Zorro team for when a particular service is not available in a monitored location.

In conclusion, both the NOC and Zorro teams monitor various aspects of our IT operations. Our monitoring apparatus, which gives us visibility across our network, helps the teams act as needed.

2. Managing everyday operations

Overall management

ManageEngine uses a primary management tool known as Zoho Admin Console (ZAC) to manage major IT operations:

  • Inventory
  • DC infrastructure
  • Alerts and notifications
  • Server allocation and provisioning
  • Load balancer configurations
  • Domains and certificates

ManageEngine uses a wide range of tools to manage various ITOM requirements.

Tool category

ITOM requirements

Tools for monitoring traffic

  • Monitor interface traffic.
  • Monitor firewall logs and network traffic.
  • Generate various reports on slow performance, network security, and more.
  • Assist with log analysis, firewall policy management, and change management.
  • Monitor switches, firewalls, interfaces, and logs.
  • Provide insights using the network dashboard, developer dashboard, service map dashboard, event dashboard, load balancer dashboard, and more.

Tools for managing device configurations

  • Make automated backups of device configurations.
  • Automate routine tasks, like firmware updates, SNMP configuration, and VLAN configuration.
  • Facilitate applying configuration changes to devices.

Tools for analyzing logs

  • Organize and analyze event logs and syslog messages.

Tools for other IT operations

  • Create custom forms to manage network changes, access controls, asset purchases, and more.
  • Maintain all NOC-related documents, policies, procedures, vendor contacts, and more.
  • Manage passwords for network devices, vendor portals, and more.
  • Track, manage, and resolve tickets for the NOC and Zorro teams.

Troubleshooting

In the previous section, we saw how the NOC and Zorro teams use monitoring tools to understand how our network is performing. Monitoring certain parameters helps us spot when there is an issue. However, to manage those issues on a regular basis, we need troubleshooting guidelines.

These guidelines provide quick action plans for the engineers when they see an issue with the monitored parameters. Here is a sample of such issues and the required troubleshooting actions:

Category

Objective

Troubleshooting action

Availability of servers

To ensure servers are in a condition to execute the assigned workload

Use the Intelligent Platform Management Interface to contact the hardware end of the server and fix the issue.

Available disk space

To ensure the disk has sufficient storage space

Replace the disk when an alert is generated for low disk space.

File system errors

To ensure storage disks are not corrupted

Replace the disk when an alert is generated for file system errors.

Hardware failures

To ensure there is no failure of hardware components that might cause a drop in performance

Repair or replace the faulty hardware depending on the warranty status.

RAID status

To ensure all the disks in the RAID are available

Replace the disk if a disk in the RAID setup is unavailable.

DNS lookup failures

To ensure the domain is reachable

Perform a DNS lookup and check the DNS service for errors.

Application ports down

To ensure applications are available

Notify application teams to restart the application ports.

Access controls

Access control mechanisms are a crucial part of managing everyday operations. We deal with the following aspects in access control:

  • Physical access
  • Logical access
  • Access to sensitive areas
  • Access for non-IT teams

Access type

Scope

Provided by

Approved by

Governed by

Physical access

Access to NOC rooms, server rooms, asset storage rooms, and IDCs

The HR team raises a request through a dedicated access control application for employees who wish to access certain spaces. The Building Management System and the HR team provide the required access.

A manager from the Zorro team reviews the purpose of the request and approves it through the access control application.

The physical security policy and access control policy

Logical access

Access to the IT operations admin console, IT ticketing system, monitoring tools, password management tools, and other related tools

A senior member of the team raises a request through the access control application via a separate access creation form. The request is then forwarded to the concerned team that must provide the access.

A manager from the respective team approves the request through the access control application.

The access control policy

Access to sensitive areas

Access to IDCs, physical access to DCs, and access to DCs from remote locations

The employee raises a request through an app dedicated for such access. The Zorro team processes it by creating an account for the user in that app after a senior member approves the request.

The request is first approved by the manager of the employee who raised it, then by the Zorro team after reviewing it.

The information security policy, access control policy, and physical security policy

Well-defined policies, processes, and applications help us manage access controls in a smooth, secure manner.

OS management

The Zorro team configures and manages OSs on all our servers. OS management is critical to ensure high availability and security. The Zorro team performs OS hardening to comply with the required security standards. It also performs patch management to ensure protection from vulnerabilities.

The Zorro team carries out OS hardening to remove unwanted packages and install the OS with minimal requirements. The process of hardening has three phases: preparation, installation, and hardening.

Preparation

Installation

Hardening

  • First, we select the latest stable version of the OS to ensure the applications perform optimally
  • Second, we select the required minimum of packages and file system configurations according to the requirements of each application.
  • Last, we test the OS, the packages, and the configurations in a test environment to ensure each application is performing optimally.
  • First, we create user accounts and assign them to NOC and Zorro team members for monitoring the installed OS.
  • Second, we install customized packages needed for the applications and uninstall all other packages.
  • Last, we enable security features like logging, antivirus, and host-based intrusion detection systems.
  • First, we remove unwanted services and packages, then enable only those required.
  • Second, we ensure high network availability by employing multiple network interface cards and switches. We also ensure network security by disabling features like IP forwarding and ICMP redirects.
  • Last, we sort out user access and permissions for monitoring the OS.

We test the hardened OS in a local environment first. Once the testing engineers confirm that the OS is free from performance and security issues, we deploy the servers in the production environment.

Risk management

The risks associated with our IT operations are mainly divided into three categories:

  • Hardware failure risks
  • Downtime risks
  • Data security risks

These three categories involve our IT teams to a major extent, while other risks are handled by our Incident Management and SPA teams. The functioning of these teams is covered in detail in our books: Incident management handbook and A CIO's guide to rethinking compliance.

For the risks involving our IT teams, we use controls (measures against risks) so that we are in the best position to handle such risks.

Risk category

Controls

Hardware failure risks

  • Maintain enough spares in our DCs for components that are prone to failure.
  • Maintain a warranty of at least three years for servers.
  • Automate disk failure detection using monitoring tools.
  • Degauss failed hard drives using a standard, well-tested procedure.

Downtime risks

  • Configure automatic alerts in Site24x7 to notify the NOC and Zorro teams.
  • Configure alerts in network monitoring tools to alert the NOC team of network device downtime.
  • Update procedures to address downtime through configuration changes or replacing faulty components.

Data security risks

  • Monitor DDoS solutions for possible attacks.
  • Monitor the traffic through network devices using a switch port management solution.
  • Establish procedures to work with the incident management team.

3. Maintaining a stable ITOM environment

ITOM support

We offer 24/7 ITOM support by having technicians work in shifts. We have segmented issues and support tickets based on difficulty. We also have a set of guidelines on how to assign tickets to technicians based on their expertise and the difficulty of the issues at hand.

To support our ITOM engineers, our in-house knowledge management tool hosts a support library. The library consists of all ITOM-related documents, like:

  • Solutions to various ITOM issues.
  • Procedures for ITOM activities.
  • The history of previous ITOM use cases and how they were solved.
  • ITOM policies.
  • ITOM process narratives.

Capacity planning

We use predictive analysis to procure servers in advance to maintain high availability. We also make future projections based on a schedule to avoid the risk of running out of servers. When we reach a threshold limit of available servers, the Zorro team purchases additional ones. It also maintains a percentage of servers in the free pool for the emergency requirements of any teams.

For components in the DCs, we gather and update information on cages, power, and other requirements using ZAC, our primary management tool. For network devices like switches and ports, we collect insights using our switch port management software to ensure device procurement matches the server procurement schedule.

Backup and maintenance

Both the Zorro and NOC teams make incremental backups on weekdays and full backups on Sundays.

The Zorro team uses ZAC to make automated backups of databases on our servers. These backups are stored on a central backup server with 256-bit AES encryption. This central backup server has the same redundancy mechanism as our main servers. We store these backups on the server for a specified period based on our backup policy. ZAC scans the backup process periodically to check if it is successful. If unsuccessful, the tool raises a ticket to be resolved by the Zorro team. It also runs various checks for the integrity of the backups based on a schedule.

To retrieve server backups, we use a separate Zoho Creator application that maintains backup requests.

The NOC team makes backups of network device configurations on a schedule similar to the one the Zorro team uses for incremental backups. The schedules and procedures for both teams are governed by our company-wide backup policy. The NOC team uses our network configuration management tool. In the event of a device failure, the backed up configuration is restored. Similar to how the Zorro team operates, the NOC team also receives a ticket when there is a backup failure.

To learn more about how we manage major changes, like changes to device configurations, check out our change management e-book. During such instances of major device configuration changes, we declare planned maintenance and execute them through our change management framework.

Disaster recovery

We follow a blueprint for disaster recovery, as discussed in our business continuity e-book. The Zorro team approaches disaster recovery in two ways:

  • Internal redundancies
  • External redundancies

Internal redundancies are used for disaster recovery when components within a DC fail. For example, during server distribution, components are distributed across racks in such a way that the failure of one or more racks will not impact the overall functioning of the DC. Likewise, ZAC allows us to perform impact analysis and restore the service to its previous best state when there is a component failure across racks or clusters.

External redundancies are used when we have to restore the entire DC. The disaster recovery DC (DR DC) is a geographically separated replica of the main DC. When the main DC encounters failure, the DR DC takes over. Until such a situation arises, the DR DC functions in read-only mode. After the DR DC takes over, the write permission is enabled.

ManageEngine Disaster recovery plan

We test our business continuity and disaster recovery models every year.

Process improvement

We monitor the performance of our ITOM teams using two methods.

1) Ticket-based:

Tickets are divided into tiers based on their difficulty level. Each tier is given a standard resolution time. We keep track of SLAs to figure out which tier takes longer to resolve and improve the process to reduce the resolution time.

For example, here is how we measure performance during a particular period:

Tier

Ticket type

Standard resolution time

Average time taken during the last quarter

Tier 1

Resolvable via direct calls to engineers during their shifts

1-2 hours

1.5 hours

Tier 3

Server provisioning, domain requests, new package installation requests, and more

Up to 3 days

2.8 days

2) Solution-based:

We approach issues based on the type of solution provided, with a standard resolution time and various applicable solutions for each category. We always work towards reducing the resolution time by implementing each solution faster.

Here is an example of that approach:

Ticket type

Solution provided

Average time taken during the last quarter

ISP-related

Switch ISPs

< 5 minutes

LAN outage

Switch to the DR DC

< 30 minutes

Get fresh content in your inbox

By clicking 'keep me in the loop', you agree to processing of personal data according to the Privacy Policy.