The Business Case for Observability and Site Reliability Engineering

BrainBlog for Moogsoft by Charles Araujo

Unlike traditional IT Ops, the role of the SRE isn’t simply focused on finding and solving technical problems. The big win for today’s SREs is supporting the organization’s strategic innovation initiatives. With the appropriate observability capabilities, it’s possible to quantify the value that software infrastructure contributes to this innovation effort.

Throughout this series, we’ve been exploring the interplay between the discipline of Site Reliability Engineering (SRE), the role of the Site Reliability Engineer (also SRE), and observability. We examined the meaning of adopting an SRE discipline, how observability differs from monitoring, and the role of automation in adopting observability.

But all of that is really a preamble to the heart of the matter: why should SREs adopt observability?

The answer is found in the long, tumultuous history of the IT function. As much as it’s tempting to see cloud-native, DevOps, and observability as wholly new endeavors, the reality is that they are the latest chapter in a story that seems to repeat itself incessantly.

Every so often, IT goes through a period of rapid innovation. And almost as fast, the need for business-critical systems to be reliable, available, and performant begins to tamp down on that innovation — until no innovation is happening at all. And then the cycle starts again.

So, as fabulous as the innovations that approaches like cloud-native and DevOps engender may be, they will be short-lived. The need to maintain reliability, availability, and performance will eventually crush the culture of innovation that organizations seek to sustain.

Unless, that is, you do something to break this cycle.

That something is why observability — and the role of the SRE — represent such a dramatic shift and opportunity for organizations that seize on it.

Why the Role of the SRE is Different

As we’ve covered extensively throughout this series, one of the most striking differences between the role of the SRE and the traditional role of IT Ops is the focus on the totality of the service experience, rather than on the mere maintenance of an operational state.

The SRE discipline and role grew out of a recognition that managing reliability, availability, and performance needed to happen from a service rather than systems perspective.

Moreover, the role was made a first-class citizen in the end-to-end continuous integration and deployment process. As a result, SREs are much more comfortable playing an active role throughout the entire application development and deployment lifecycle — and this fact is critical to the essential role that they and observability play in sustaining an organization’s innovation culture.

The traditional role of IT Ops looked at new applications or changes to existing applications as an operational burden. They introduced change — and change inevitably impacted the ability of IT Ops to maintain a ready state.

While the role of the SRE is ostensibly the same — at least on the surface — its strategic and integrated posture fundamentally shifts the focus.

Being able to affect and positively impact reliability, availability, and performance throughout the development process transforms the cultural posture of the entire organization. Moreover, the focus on the total end-user experience, rather than a systems-centric view of the operational state, gives the SRE a different, more strategically-aligned viewpoint.

Read the entire BrainBlog here.

SHARE THIS: