SLA, SLO, SLI: The KPI trio for exceptional service management
September 13 | 07 mins read
ITSM is constantly on a quest to provide better end-user experiences. Traditional frameworks, while valuable, often don't address the digital needs of the users and the business. They also lack methods for assessing service quality.
Born from Google's own need for robust and scalable systems, site reliability engineering (SRE) offers valuable principles that can be leveraged to perfect your ITSM practices.
One core tenet where SRE and ITSM intersect is the concept of SLAs. While SLAs have long been a part of ITSM, SRE introduces service level objectives (SLOs) and service level indicators (SLIs).
This trio—SLA, SLO, SLI—prioritizes shared goals between the IT service desk and the employees, focuses on clear communication, and enhances user experience. This way, ITSM can actually deliver on the user experience it promises by having a more granular and user-centric approach to measuring service performance.
Let's dive into how the SRE principles of SLA, SLO and SLI are fundamentally reshaping IT organizations' approach to service delivery.
SLA
Service level agreement refers to a formal agreement between the IT service desk and the employees. This essential agreement sets the ground rules for expected services by assigning responsibilities to the service desk teams, including detailed escalation protocols if agreed-upon service levels are not met.
For instance, the SLA for a cloud-based ERP system might specify a monthly uptime guarantee or define how quickly the services should be restored, that is, the resolution time for downtime incidents.
The SLA also outlines the proactive or reactive escalation actions for exceeding the downtime threshold. The escalation actions could include notifying the respective stakeholders regarding the downtime, bringing in subject-matter experts, spiking the priority of incident tickets, or even performing all of these escalations at once.
This way, the SLA helps ensure transparency and accountability for the IT service desks and the employees throughout the service delivery process.
Be sure to check out our previous article on how SLAs work in tandem with OLAs to set the right IT service delivery expectations.
SLO
Service level objectives, a component of SLA, are carefully designed, concrete, numerical targets for the IT service desk teams. They define the desired performance level for a specific service, translating the promises of the SLA into actionable goals.
SLOs are essential to demarcate the threshold levels for good and bad services. SLO targets are expressed as various metrics like uptime percentages, average resolution times for tickets, or employee satisfaction scores.
Continuing with the ERP system example, in the SLA specifying the uptime guarantee, the company might set an SLO for uptime as 99.95% and for the resolution timeframe as 20 minutes for all downtime-related incident tickets.
SLOs are mostly set internally to establish clear, measurable targets, thereby helping the IT team stay focused on its goals. These internal SLOs are different from those mentioned in the SLA in a way that they are more ambitious. For example, the company might set an internal SLO for uptime of 99.99% and resolution time of 12 minutes, exceeding the guaranteed 99.95% uptime and 20 minute resolution timeframe in the SLA to create a buffer for unforeseen circumstances. This buffer between the ambitious level (internal SLO) and the promised level (as mentioned in the SLA) allows room for minor errors to occur, and is defined as the error budget.
This approach lets teams prioritize addressing issues before they become major problems, like an SLA breach.
SLI
Service level indicators are the measurable metrics used to track progress towards the SLOs. Meaning, they measure the compliance of the IT team's service with the preset SLOs.
SLIs are usually measured in percentage. They can range from 0% (nothing is functioning) to 100% (everything is perfect).
In our ERP system example, where the SLA stated an uptime percentage with an SLO set as 99.5% uptime, SLI would be the actual measurement of the uptime, perhaps 99.66%. The SLI for resolution timeframe would be the actual resolution time of individual tickets measured against the desired SLO target. For example, the SLI can be 17 minutes, which is less than the 20-minute SLO target.
Not every SLO should be measured as an SLI. It is crucial to evaluate the metrics that directly impact the employees, and only those metrics have to be tracked. For example, in an ERP system, tracking the number of user logins or the average user activity duration doesn't tell you much about the effectiveness of your service.
Instead, the SLIs could be the measurement of system availability during business hours, response and resolution times, or the number of internal SLO-breached tickets. By monitoring these metrics, the service desk teams can identify and address potential issues early on, preventing them from snowballing into major issues and escalations.
The feedback loop
Once you fully grasp what SLA, SLO and SLI are, you can deduce that they work in a feedback loop.
- The SLA defines the expectations for the employee and SLOs are the individual targets within the SLA to be achieved. To analyze whether the team is hitting the SLOs, SLIs are measured.
- By analyzing SLI data, the team can identify areas for improvement and adjust the approach to reach the SLOs. If the team consistently miss SLOs and has low SLI ratings, it might be time to revisit the SLA and adjust expectations.
Comparing SLA, SLO and SLI
Characteristics | SLA | SLO | SLI |
---|---|---|---|
Definition | Formal agreement between the IT service desk and the employees that defines the expected level of service. | Specific measurable target that defines the desired performance level for a service mentioned in the SLA. | Metrics used to measure the actual performance of the service desk. |
Scope | To set expectations for the employees. | To define the performance goals for the IT service teams to meet the SLA. | To measure and gain insights about the actual performance of the service desk. |
Degree of granularity | Broad and encompassing multiple SLOs. | Specific and focused on individual performance metrics. | Detailed and often numerous, providing granular data. |
Flexibility | More rigid with proactive and reactive escalations. | The external SLOs, a part of the SLA, are rigid. The internal SLOs are flexible as the objectives can change based on the potential of the service desk teams | - |
Wrapping up
- SLA is the overall agreement between the IT service desk and employees.
- SLOs are the internal targets set to meet the SLA.
- SLIs are the actual measurements of the targets mentioned in the SLA.
By working together, this trio empowers the IT service desk teams to set clear expectations for their employees, identify potential issues before they impact service delivery, and keep the services at the top of the game.