DevOps

Service Level Indicator (SLI)

What is a Service Level Indicator (SLI)?

A Service Level Indicator (SLI) is a carefully defined quantitative measure of some aspect of the level of service that is provided. Common SLIs include metrics like latency, throughput, and error rate. SLIs are used to set and measure Service Level Objectives (SLOs).

In the world of DevOps, Service Level Indicators (SLIs) play a crucial role in ensuring the quality, reliability, and performance of software services. They are a fundamental part of Service Level Objectives (SLOs) and Service Level Agreements (SLAs), which are key components of site reliability engineering (SRE).

SLIs are quantitative measures used to define the level of service provided by a service provider. They are typically expressed as a percentage, representing the proportion of time that the service is functioning correctly within a specified time period. This article will delve into the intricacies of SLIs, their history, use cases, and specific examples in the context of DevOps.

Definition of Service Level Indicator (SLI)

A Service Level Indicator (SLI) is a carefully defined quantitative measure that represents the level of service provided by a service provider. In the context of DevOps and SRE, an SLI is a measure of a service's reliability and performance. It is a key component of Service Level Objectives (SLOs) and Service Level Agreements (SLAs).

SLIs are typically expressed as a percentage, which represents the proportion of time that the service is functioning correctly within a specified time period. For example, an SLI could be defined as "the percentage of requests that return a successful response within 200 milliseconds".

Components of an SLI

An SLI is composed of two main components: an event and a success criterion. An event is an interaction between a user and the service, such as a request made to a web server. The success criterion is the condition that determines whether the event is considered successful or not.

For example, in the case of a web server, an event could be a request made to the server, and the success criterion could be that the server returns a response within 200 milliseconds. If the server meets this criterion, the event is considered successful, and it contributes to the SLI's percentage of successful events.

Types of SLIs

There are several types of SLIs, each measuring a different aspect of a service's performance. The most common types of SLIs are availability, latency, throughput, and error rate.

Availability measures the proportion of time that a service is available for use. Latency measures the time it takes for a service to respond to a request. Throughput measures the number of requests that a service can handle in a given time period. Error rate measures the proportion of requests that result in errors.

History of Service Level Indicators

The concept of Service Level Indicators (SLIs) originated in the field of telecommunications, where they were used to measure the quality of service provided by telecommunications networks. They were later adopted by the IT industry as a way to measure the performance and reliability of IT services.

The use of SLIs in the field of DevOps and site reliability engineering (SRE) was popularized by Google, which published a book on SRE in 2016 that included a detailed discussion of SLIs, SLOs, and SLAs. Since then, the use of SLIs has become a standard practice in the field of SRE.

SLIs in Telecommunications

In the field of telecommunications, SLIs were used to measure various aspects of the quality of service provided by telecommunications networks, such as the availability of the network, the latency of the network, and the error rate of the network.

These SLIs were typically defined in terms of the proportion of time that the network was available, the average time it took for a signal to travel from one point in the network to another, and the proportion of signals that were lost or corrupted during transmission.

SLIs in IT and DevOps

In the IT industry, SLIs were initially used to measure the performance and reliability of IT services, such as web servers and databases. They were typically defined in terms of the availability of the service, the latency of the service, the throughput of the service, and the error rate of the service.

With the advent of DevOps and SRE, the use of SLIs has become more sophisticated. They are now used to measure not only the performance and reliability of individual services, but also the performance and reliability of entire systems of services. In addition, they are used to inform the design and operation of these systems, helping to ensure that they meet the needs of their users.

Use Cases of Service Level Indicators

Service Level Indicators (SLIs) are used in a variety of contexts in the field of DevOps and site reliability engineering (SRE). They are used to measure the performance and reliability of services, to inform the design and operation of services, and to define Service Level Objectives (SLOs) and Service Level Agreements (SLAs).

SLIs are also used to track the progress of improvement efforts, to identify areas where improvement is needed, and to provide a basis for communication between service providers and their users.

Measuring Performance and Reliability

One of the primary uses of SLIs is to measure the performance and reliability of services. By defining quantitative measures of performance and reliability, SLIs provide a clear and objective way to assess the quality of a service.

For example, an SLI could be used to measure the latency of a web server, providing a clear measure of the server's responsiveness. By tracking this SLI over time, the performance of the server can be monitored and any degradation in performance can be detected.

Informing Design and Operation

SLIs are also used to inform the design and operation of services. By defining what constitutes a successful event, SLIs provide a clear target for the design and operation of a service.

For example, if an SLI is defined in terms of the latency of a web server, this provides a clear target for the design of the server. The server should be designed and operated in such a way as to meet this target as consistently as possible.

Examples of Service Level Indicators

There are many specific examples of Service Level Indicators (SLIs) in the field of DevOps and site reliability engineering (SRE). These examples illustrate the wide range of aspects of service performance and reliability that can be measured using SLIs.

It's important to note that the specific SLIs used in a given context will depend on the nature of the service and the needs of its users. Therefore, the examples provided here should be seen as illustrative, rather than prescriptive.

Availability SLI

An Availability SLI measures the proportion of time that a service is available for use. This is typically defined in terms of the proportion of requests that are successfully handled by the service.

For example, an Availability SLI for a web server might be defined as "the percentage of requests that return a successful response". This SLI would measure the proportion of time that the server is able to handle requests successfully.

Latency SLI

A Latency SLI measures the time it takes for a service to respond to a request. This is typically defined in terms of the proportion of requests that are handled within a specified time limit.

For example, a Latency SLI for a web server might be defined as "the percentage of requests that return a response within 200 milliseconds". This SLI would measure the responsiveness of the server, providing a clear measure of its performance.

Throughput SLI

A Throughput SLI measures the number of requests that a service can handle in a given time period. This is typically defined in terms of the number of requests per second that the service can handle.

For example, a Throughput SLI for a web server might be defined as "the number of requests per second that the server can handle". This SLI would measure the capacity of the server, providing a clear measure of its scalability.

Error Rate SLI

An Error Rate SLI measures the proportion of requests that result in errors. This is typically defined in terms of the proportion of requests that return an error response.

For example, an Error Rate SLI for a web server might be defined as "the percentage of requests that return an error response". This SLI would measure the reliability of the server, providing a clear measure of its quality.

Join other high-impact Eng teams using Graph
Ready to join the revolution?
Join other high-impact Eng teams using Graph
Ready to join the revolution?

Build more, chase less

Add to Slack