Moving the goal: How the edge services observability game has changed

September 20, 2024

An Intellyx BrainBlog by Jason English, for Hydrolix

We’ve been talking about ‘the edge’ for quite a while now. Before microservices, before cloud, before virtual networks, and even before the Y2K panic, there was a notion of edge computing as a means to higher performance.

Early-to-market edge computing services like Akamai offered access to geographically closer points of presence with lower latency, so businesses could cache website content, ‘hot’ data, and computational functions that would benefit customers with more responsiveness. That basic value proposition hasn’t changed.

However, thanks to the proliferation of hybrid cloud environments, streaming data, and smart devices, the modern edge services game is being played on an entirely different field, one without boundaries or standardized rules. It almost seems futile to set goals for improved application performance or availability across multiple content delivery networks (CDNs), when there are so many dimensions of observability data to keep track of.

The log jam of too many CDN and edge data sources

Merely looking at average page load and query response times with traditional monitoring tools provides little context into the root causes of performance problems. Modern observability platforms arose to resolve this challenge, providing deeper performance metrics and trend lines across servers in the datacenter and cloud infrastructures.

A data log jam started developing when organizations started leaning on observability and SIEM platforms more heavily to gather insights from all distributed services across the IT estate. Since observability software vendors weren’t really set up to efficiently process and store huge volumes of event data, many end customers saw their platform storage costs multiply rapidly.

Cloud data warehouses and data lakes scaled up to answer the data volume problem with lower incremental costs, to great commercial success. By filtering event data with time-based intervals, and prioritizing the storage of log data into hot, warm, and cold availability tiers, observability efforts reached a better balance of system health visibility versus cloud cost.

Alongside innovations in microservices and cloud native architectures, improved system-wide observability helped software delivery teams make incredible strides toward improving scalability and performance. As developers and SREs started honing in on performance bottlenecks, they started looking to remote edge networks and CDNs such as Cloudfront and Fastly to bring systems and data closer to end users.

The benefits of edge services for responsive customer experiences comes with a significant side effect: the difficulty of dealing with a massive flood of inconsistent log data and machine-to-machine events from multiple CDNs and endpoints.

Why edge observability data is different

When you look at a monitoring dashboard for conventional applications, you see that most of the incoming log data is associated at a system level—coming from an addressable Linux or Windows OS on a server with a static IP address, or encapsulated in a container.

Adding edge services takes log data complexity to a whole new level for several reasons:

High cardinality. Edge services and CDN generate infinitely unique data, thanks to the ephemeral nature of elastically deployed and decommissioned workloads, running on a vast array of microservices and different endpoints. Thousands of Kubernetes clusters may come and go in a minute, each with its own unique ID, namespace, configuration, and source, and each may be running on temporary cloud infrastructure or remote devices.

Machine-to-machine transactions. A human request is only the first moment of a long relay race of edge services, event feeds and core systems talking to each other before a result shows up in the user’s app. A single workflow may take several hops across disparate systems, with each hop generating unique interaction data between systems, as well as potential malicious bots or signal interceptions. Tracking a complete workflow requires correlating events across chained transactions.

Standards mismatch. When an observability platform takes in data from millions of unique nodes on multiple CDNs and cloud instances, there are bound to be incompatibilities between log files. Different cloud service providers, software vendors and open source tools generate records with different naming conventions, different labeling standards, and differently formatted attributes within the data. Either way, even if the format is not under your control, it’s still your responsibility to make sense of the data.

Performance and cost tradeoffs. Enterprises need to track service level objectives (SLOs) and indicators (SLIs) of edge services infrastructure using log data that is as fresh and complete as possible. If logs are overly sampled or pushed off to cold data tiers in order to save cloud storage and data processing costs, it will present an incomplete picture of performance analytics. Reliably delivering on edge services SLOs requires complete and near real-time data, as well as ‘hot’ historical data ranging back as much as a year or more that is ready for measurement and analysis.

Supporting multiple CDN instances at TF1

Leading French broadcaster and streaming media provider TF1 developed their own internal CDN capacity, as well as recently adopting a commercial CDN for improving their media distribution over multiple ISPs to smart TVs, browsers, and mobile device endpoints.

Multiple CDNs improved the performance of video streaming for a greater audience of customers, but also caused a mismatch between logging systems, making it hard for IT teams to track performance objectives by manually unpacking archives and browsing log data in S3 buckets and transforming raw log files stored in different places.

The company turned to Hydrolix, with a streaming data lake purpose-built for dealing with the high cardinality and high data volume issues of logs emanating from multi-CDN environments.

Using Hydrolix, TF1 automated data transformation and compression, and set up search and monitoring capabilities for near-real-time logs in a 15-month ‘hot’ data storage tier, while also creating regular query intervals of 5 hours—or 5 days—for more consolidated trend reporting, enabling them to also right-size cloud data costs according to their needs.

“It’s really a visibility matter,” said Simon LaRoque, Project Manager Streaming, TF1, in a recent interview. “I have a number of CDN vendors, but I also have peering capacity, transit. Then I also have the ISP networks involved. So the better visibility I have, the simpler it is to pinpoint the root cause of an issue.”

The Intellyx Take

Combining different edge services to meet performance and agility objectives requires us to change the way we think about handling high volume and highly variable log data.

Performance KPIs and objectives are only useful if the data that feeds them is valid, timely, and rich enough for near-real-time queries and analytics. Don’t skimp on a complete view of multi-CDN data when modern solutions can affordably keep cloud and storage costs from spiraling out of control.

As application architectures evolve toward moving certain high-performance workloads to multiple CDNs in order to improve customer proximity and responsiveness, enterprises need near-real-time observability into the inner workings of edge services as they interact with the real world.

©2024 Intellyx B.V. Intellyx is editorially responsible for this document. At the time of writing, Hydrolix is an Intellyx customer. None of the other organizations mentioned here are Intellyx customers. No AI bots were used to write this content. Image sources: Adobe Image Express.

Jason English

Principal Analyst & CMO, Intellyx. Twitter: @bluefug

Root Signals: Pulling Signals from LLM Noise

API7.ai: Open-Source API Gateway with Commercial Enterprise Offering