Case studies:
The ITOM framework in action
Note: ManageEngine uses a wide range of tools for its ITOM framework. To simplify the concepts, we left out the specific names of the tools in this section. However, if you would like to know the names of the tools we use, refer to the "Glossary of tools" section at the very end of this e-book.
Case study 1:
Navigating network traffic
Scenario: A set of users is consuming too much bandwidth. They are downloading too much data within a short period.
1. Setting up a monitoring apparatus
Our bandwidth monitoring tool scans each port to understand the excess consumption.
2. Managing everyday operations
Alerts are configured to let shift technicians know when the consumption crosses threshold values.
3. Maintaining a stable ITOM environment
The technicians receive guidelines from our experienced IT leaders on handling specific requirements and exceptions.
First, to understand the issue better, we use our bandwidth monitoring tool to know the bandwidth consumption. The tool scans each port and retrieves details such as the users connected to each port, what sites they visited, and how much bandwidth they consumed. We gather this information at every user level and at every switch level. We analyze the consumption by a user or device to understand the issue better.
Next, once we have enough information about this anomaly, we use configured alerts to ensure that the technicians working during that shift are aware that the bandwidth consumption has crossed threshold values. ZAC has features that allow us to send customized notifications based on configured alerts.
Finally, our ITOM support technicians decide how to fix this anomaly. They may enforce firewall restrictions on a specific site, do traffic shaping for a specific user, or allow the bandwidth consumption to go on by diverting it to another ISP that has greater bandwidth. For instance, if the users are testing a set of tools or working on an important project release on tight deadlines, they might need that extra bandwidth.
Case study 2:
Approaching downtime
Scenario: Our primary DC in the US is experiencing downtime.
1. Setting up a monitoring apparatus
Our monitoring tool sends out an alert on our monitoring screens.
2. Managing everyday operations
The troubleshooting section provides the engineers with checklists and guidelines for tracing out the root cause of the problem.
3. Maintaining a stable ITOM environment
Through the process improvement section, we track how long it took to implement the fix and how we can improve on it. We also document the fix so that any NOC engineer in the future
can refer to it.
First, ZAC sends out a red alert on our NOC monitoring screens. The tool also provides insights into the statuses of other components in the DC, like routers, switches, firewalls, load balancers, WLAN controllers, servers, and VMs. This helps the engineers get a good idea of where to start.
Next, the troubleshooting section in our framework provides detailed checklists on how to handle downtime. For instance, we check the following in order to see where the problem is: switch, firewall, server, router, ISP, and upstream provider paths.
For instance, ZAC automatically runs a traceroute so the engineer knows where to start using the packet's path. Then they perform the following actions: ping, duplicate, and check the destination firewall. It becomes a process of testing out various possibilities. For example:
- If the destination firewall is reachable, what do you do, and where do you check?
- If the destination firewall is not reachable, the issue could be anywhere in the following areas:
- The network
- The hardware side
- A software bug
- The tunnel through which the traffic passes
- The ISP
Finally, the process improvement section helps us document this effort, noting the time taken to fix the issue and how it was fixed. We analyze the documentation to see what we could have done better. Documenting ensures that the next time we face downtime due to a similar issue, we are able to fix it faster.
Case study 3:
Managing network devices
Scenario: We need to create custom VLAN configurations (VLAN A and VLAN B) in a group of switches.
1. Setting up a monitoring apparatus
ZAC maintains an inventory of all available switches.
2. Managing everyday operations
Our configuration management tool helps create scripts and execute them in multiple groups of switches simultaneously.
3. Maintaining a stable ITOM environment
Based on our backup policy, the configuration management tool backs up continuously to maintain these configurations.
First, ZAC provides us with visibility into each of our switches.
Next, our configuration management tool helps us group them. In this case, switches 1-20 make up group A, and switches 21-40 make up group B. For group A, the tool lets us create a configuration: VLAN A. We write a script and push it to group A using the tool. We can also schedule the scripts. For instance, we can write a customized script and schedule it to go live in group B on Monday at 10am.
Finally, we back up these configurations using weekly and incremental backups based on our backup policy.
Case study 4:
Handling corrupt devices
Scenario: We need to handle a corrupt switch in the network.
1. Setting up a monitoring apparatus
Our primary monitoring tool monitors all the switches and works with our configuration tool to make automatic backups of configurations based on a set schedule.
2. Managing everyday operations
Our troubleshooting guidelines help engineers understand which backup is relevant to the situation and must be restored.
3. Maintaining a stable ITOM environment
Based on our ITOM policy, we maintain an algorithm that backs up different types of switches at different times of the day so we know which one to consider.
First, we must understand that even if the switch got corrupted today, we still need to restore it to the latest point so that we do not lose configurations. If we simply restore older data, the latest configurations will not be reflected in the network. Thus, we use our configuration management tool to automatically make daily and hourly backups.
Next, our technicians use troubleshooting guidelines to understand which backup is relevant to the situation.
Finally, based on our backup policy, we have an algorithm that backs up access switches and port switches at different times of the day. We make backups of all the devices in both the primary DC and the DR DC. If a backup fails for a device, we get an automated support ticket through configured alerts. We immediately fix the issue and restore the backup. For example, if the backup scheduled for 5pm fails, we will receive an email at 5:01pm. With manual intervention, we can fix it by 5:15pm. We perform such restorations based on our ITOM policy.
Case study 5:
Managing DC capabilities
Scenario: One of our DCs needs more servers. Do we have enough server space?
1. Setting up a monitoring apparatus
ZAC acquires the necessary data regarding all the components in the DC.
2. Managing everyday operations
Following the guidelines, our IT leaders use predictive analytics to anticipate when a particular facility will run
out of capacity.
3. Maintaining a stable ITOM environment
With capacity planning, we make projections and perform regular audits of our DCs in order to have a comprehensive overview
of their capabilities.
First, let us assume one of our DCs needs 200 more servers. Using ZAC, we learn how many servers are live already and how many ports are being used. If there are only enough ports for 150 more servers in this DC, ZAC will let us know.
Next, our IT leaders use predictive analytics as mentioned in our troubleshooting guidelines to anticipate when a DC will run out of capacity. In this scenario, we can simply procure more switches for the additional 50 servers needed.
Finally, with capacity planning and predictive analytics, we can make sure we are always aware of our capacity and do not need drastic procurement to manage it.
Case study 6:
Handling ITOM changes
Scenario: We need to provide firewall access for groups and limit access for individuals according to specific situations.
1. Setting up a monitoring apparatus
Our log analysis and monitoring tool monitors firewall logs, generates reports, and helps us understand the
firewall activity.
2. Managing everyday operations
A change control form hosted on a Creator application tracks the change from
beginning to end.
3. Maintaining a stable ITOM environment
The ITOM support library maintains all the information regarding network changes for future reference.
First, our log monitoring tool analyzes the logs from our firewalls and generates real-time alert notifications and security and bandwidth reports. It analyzes syslog messages and monitors firewall activity. The tool's features allow us to spot duplicate firewall rules, then cut them down or simplify them.
Next, once we spot firewall activity and the NOC team decides to change the firewall access, it raises a request through the change control form hosted on a Creator application used for overall management. The change control form hosts all the data needed for the respective manager to approve the change and for the engineer to execute the change. This allows the change to be tracked all the way through completion and review. The data includes the location of the change request, the type of change request, the proposed duration of the change, the backup plan, and the person responsible for implementing the change.
Finally, once the change request is fulfilled, the details regarding the change are maintained in the ITOM support library.
Case study 7:
Managing DC security
Scenario: An employee wants logical access to a DC.
1. Setting up a monitoring apparatus
ZAC has a database of all logical access that can be given to our employees. We check the database to see if the employee
needs the access.
2. Managing everyday operations
An access control form hosted on a dedicated application tracks the access request from beginning to end. An engineer reviews the request and provides the required access by creating an instance in our password management tool.
3. Maintaining a stable ITOM environment
Our ITOM support library has guidelines on how to ensure the given access does not turn into a security issue. Based on these guidelines, we constantly monitor the logs for suspicious activity.
First, our security operations center (SOC) reviews the activity logs across the DCs. Its monitoring tools detect anomalies and suspicious activity.
Next, the SOC team works with the Zorro team to ensure DC security.
Finally, they use the security guidelines in the ITOM support library to analyze logs and events.