Service Threat Engineering: Taking a Page From Site Reliability Engineering

October 12, 2022

DZone article by Jason Bloomberg

Cloud-native computing extends well past Kubernetes-based infrastructure to roll up many modern best-practice approaches to building, running, and leveraging software assets at scale. The cloud-native approach then extends these practices beyond the cloud to the entire IT landscape.

Included in this list of best practices are ones that fall into the category of site reliability engineering (SRE). At the core of the practice of SRE is a modern approach to managing the risks inherent in running complex, dynamic software deployments – risks like downtime, slowdowns, and the like.

Following the cloud-native approach, we should extend these practices to all software landscape risks, including cybersecurity risks.

What, then, might it look like to apply SRE principles beyond their traditional focus on reliability to the full breadth of cybersecurity risk?

Error Budgets: The Key to Cloud Native SRE

To tie SRE and cybersecurity together, we need a bit of background, starting with Service Level Objectives.

The Service Level Objective (SLO) for a site, system, or service (collectively ‘service’) is a precise numerical target for any reliability dimension an organization wants to measure for a given user journey.

For example, an SLO might quantify the availability of a service, the latency or the freshness of the information provided to users at the user interface, or other key performance metrics important to the business.

Based upon this SLO, the ops team and its stakeholders can make fact-based judgments about whether to increase a service’s reliability (and hence, its cost) or lower its reliability and cost to increase the speed of development of the applications providing the service.

Instead of targeting perfection – SLOs of 100% that reflect no issues – the real question is just how far short of perfect reliability you should aim for. We call this quantity the error budget.

The error budget represents the number of allowable errors in a given time window that results from an SLO target of less than 100%. In other words, this budget represents the total number of errors a particular service can accumulate over time before users become dissatisfied with the service.

Most importantly, it should never be the operator’s goal to entirely eliminate reliability issues because such an approach would both be too costly and take too long – thus impacting the ability of the organization to deploy software quickly and run dynamic software at scale (both of which are core cloud native practices).

Instead, the operator should maintain an optimal balance between cost, speed, and reliability. Error budgets quantify this balance.

Read the entire article here.