Leveraging GitOps and Site Reliability Engineering to Manage Change at Scale

July 20, 2021

BrainBlog for Nobl9 by Jason Bloomberg

The enterprise IT challenge of the day: how to manage – and leverage – near-constant change?

In the previous BrainBlog post in this series, my colleague Jason English explained how important service-level objectives (SLOs) are to managing such change:

“SLOs support an iterative, DevOps delivery process that embraces constant change,” English wrote. “Continuous delivery of code to production is merged with continuous observability of the impact of each change in production, and the resulting SLIs can fulfill existing SLOs while helping to identify new SLOs for improvement.”

At the heart of this DevOps delivery process is CI/CD: the continuous integration and continuous deployment that results in working code ready for production.

Today, the ability to leverage change is a top business priority. SRE, leveraging SLOs and GitOps in cloud-native environments, is becoming the only way to deliver such change within the constraints of the business.

Deployment isn’t the end of the process, however. Releasing code is the missing step: putting new software in front of customers and end-users, while ensuring it meets the ongoing objectives of the business.

It is at the point of software release and thereafter that site reliability engineering (SRE) can leverage SLOs to balance these business needs with the technical measures of success that service level indicators (SLIs) represent.

As organizations’ software deployments mature to take advantage of constant change, site reliability engineers increasingly focus on Kubernetes-powered cloud-native environments. However, the massive scale and ephemerality of the operational environment requires an end-to-end rethink of how to release software into production and operate it once it’s there.

Service-Level Objectives for Cloud-Native Computing

While most enterprises are currently in the midst of ramping up their Kubernetes deployments, certain industries are already looking ahead to the need for unprecedented scale.

On the one hand, this explosive growth in business demand for ephemerality and scale is driving the exceptionally rapid maturation of the Kubernetes ecosystem.

On the other hand, all this cutting-edge technology has to actually work. And that’s where cloud-native operations fits in.

Cloud-native computing takes the established ‘infrastructure as code’ principle and extends it to model-driven, configuration-based infrastructure. Cloud-native also leverages the shift-left, immutable infrastructure principle.

While a model-driven, configuration-based approach to software deployment is necessary for achieving the goals of cloud-native computing, it is not sufficient to address the challenges of ensuring the scale and ephemerality characteristics of deployed software in the cloud-native context.

Software teams must extend such configurability to production environments in a way that expects and deals with ongoing change in production. To this end, various ‘shift-right’ activities including canary deployments, blue/green rollouts, automated rollbacks, chaos engineering, and other techniques are necessary to both deal with and take advantage of ongoing, often unpredictable change in production environments.

‘Shift-right’ (not to be confused with ‘shift-left’) refers to the fact that these actions take place after the release of software – in the live, production environment. The reality of modern, cloud-native computing is that change is so constant that the core of testing – making sure the software meets the business need – must take place in production.

SLOs are absolutely essential in such shift-right scenarios, as the balancing act between performance and user experience takes place directly in front of users. The DevOps and SRE teams must manage these factors in real-time in order to keep the software in production within the error budget on an ongoing basis.

Read the entire article here.