DataOps Resiliency: Tracking Down Toxic Workloads

May 16, 2023

Intellyx BrainBlog for Unravel Data by Jason Bloomberg, Managing Partner, Intellyx

Part 4 of the Demystifying Data Observability Series for Unravel Data

In the first three articles in this four-post series, my colleague Jason English and I explored DataOps observability, the connection between DevOps and DataOps, and data-centric FinOps best practices.

In this concluding article in the series, I’ll explore DataOps resiliency – not simply how to prevent data-related problems, but also how to recover from them quickly, ideally without impacting the business and its customers.

Observability is essential for any kind of IT resiliency – you can’t fix what you can’t see – and DataOps is no exception. Failures can occur anywhere in the stack, from the applications on down to the hardware. Understanding the root causes of such failures is the first step to fixing, or ideally preventing, them.

The same sorts of resiliency problems that impact the IT environment at large can certainly impact the data estate. Even so, traditional observability and incident management tools don’t address specific problems unique to the world of data processing.

In particular, DataOps resiliency must address the problem of toxic workloads.

Understanding Toxic Workloads

Toxic data workloads are as old as relational database management systems (RDBMSs), if not older. Anyone who works with SQL on large databases knows there are some queries that will cause the RDBMS to slow dramatically or completely grind to a halt.

The simplest example: SELECT * FROM TRANSACTIONS where the TRANSACTIONS table has millions of rows. Oops! Your resultset also has millions of rows!

JOINs, of course, are more problematic, because they are difficult to construct, and it’s even more difficult to predict their behavior in databases with complex structures.

Such toxic workloads caused problems in the days of single on-premises databases. As organizations implemented data warehouses, the risks compounded, requiring increasing expertise from a scarce cadre of query-building experts.

Today we have data lakes as well as data warehouses, often running in the cloud where the meter is running all the time. Organizations also leverage streaming data, as well as complex data pipelines that mix different types of data in real time.

With all this innovation and complexity, the toxic workload problem hasn’t gone away. In fact, it has gotten worse, as the nuances of such workloads have expanded.

Read the entire BrainBlog here.