Site Reliability Engineering, Observability, and the Tradeoffs of Modern Software

December 1, 2021

BrainBlog for Moogsoft by Jason Bloomberg

This blog post defines SRE by explaining SLOs and error budgets, highlighting the innovation vs. reliability tradeoff.

The most striking difference between modern enterprise software development and the practices of the past is the increasing focus on the importance of deployment velocity.

Where a monthly (or slower) release cadence was considered routine, today’s enterprises find that the fast pace of innovation is driving an increasing proportion of their software initiatives to have daily or even hourly releases.

Such velocity requires a rethink of every aspect of the software development lifecycle – in particular, operations. Operators must maintain operational priorities centering on availability and reliability in the context of rapid deployment cadences across the IT landscape.

Operators must therefore weigh the tradeoffs between reliability and availability on the one hand and deployment velocity on the other – while simultaneously managing costs.

Measurable best practices for managing such tradeoffs are at the heart of site reliability engineering (SRE), and its focus on the error budgets that represent such tradeoffs.

Defining Service-Level Objectives (SLOs)

Increasing deployment velocity requires that IT leadership shift its attention toward a business expectation model that enables the business to define targets that match IT delivery models to business outcomes.

In order to accomplish this expectation model, organizations require a formal statement of users’ expectations about any particular dimension of reliability: what we call the service-level indicator, or SLI. The SLI is the proportion of valid events that were good, expressed as a percentage.

In this context, ‘good’ can refer to availability, latency, the freshness of the information provided to users at the user interface, or other key performance metrics that are important to the business. For example, an SLI might state that 99.9% of valid requests for the page index.html were successful (returned a 200 ‘OK’ HTTP code).

Each SLI provides a guideline for each dimension of reliability an organization wants to observe and measure for a given user journey. Once the ops team has specified the SLIs important to the business, they must make the appropriate decisions about measurement and validity, essentially classifying which events are ‘good.’

The Service Level Objective (SLO) for a system, in turn, is a precise numerical target for any such dimension – the availability of the system, for example. To define your SLO, start with your SLIs. Make sure they have an event and success criterion. The SLO, in turn, specifies the proportion of SLI events that were good.

For example, your SLO might state that 99.9% of the valid requests for the page index.html over the last 30 days returned a 200 ‘success’ code in 150 milliseconds or less.

Read the entire BrainBlog here.