Observability with Context Telemetry, Time, Tracing, and Topology

November 17, 2020

Intellyx BrainBlog for StackState by Jason Bloomberg

What changed?

That’s the question ops personnel have been asking for decades whenever something goes wrong in the production IT environment.

Everything was working before, so the reasoning goes, and now it’s not. We have an incident. And to figure out what caused the incident – and hence, to have any idea how to fix it – we must know what changed.

There’s just one problem with this approach. What if everything is subject to change, all the time? Simply identifying one needle in a veritable haystack of needles would hardly be an effective approach for getting to the bottom of an incident.

Instead, we can’t only look at the individual events – the ‘what changed’ bits that gave us clues in the past. Instead, we must look at the big picture – how everything fits together. In other words, the context for such events.

Given the complex and dynamic nature of today’s production environments, however, we can’t wait around for this big picture to materialize. We must understand the context for each incident in real-time if we have any hope of resolving issues promptly and cost-effectively.

Here’s how it’s done.

Read the entire article here.