Thought experiment: let’s say your app gets a message from somewhere, perhaps from another app, but you don’t know from where. The message contains the number 47 but no other information. What should your app do with the message?
The answer: nothing. There’s no way for your app to make any sense out of a single datum with no context, no additional information or metadata about the datum itself.
Now, let’s scale up this thought experiment to a data lake. There are a few common definitions of data lake, but perhaps the most straightforward is a storage repository that holds a vast amount of raw data in its native format until it is needed.
True, there may be metadata in a data lake, thrown in along with the data they describe – but there is no commonality among such metadata, and furthermore, the context of the information in the lake is likely to be lost, just as a bucket of water poured into a real lake loses its identity.
If data lakes present such challenges, then why are we talking about them, and worse, actually implementing them? The main reason: because we can.
With today’s data collection and storage technologies, and in particular Hadoop and the Hadoop Distributed File System (HDFS), we now have the ability to collect and retain vast swaths of diverse data sets in their raw, as-is formats, in hopes that someone will find value in them down the road “just-in-time” – where any necessary processing an analytics take place in real-time at the time of need.
This era of data abundance is relatively new. Only a handful of years ago, we had no choice but to transform and summarize diverse data sets ahead of time in order to populate our data marts and data warehouses.
Today, in contrast, we can simply store everything, ostensibly without caring about what such data are good for or how they are organized, on the off chance that someone will come along and find a good use for them.
Yet as with the number 47 in the example above, data by themselves may not be useful at all. Simply collecting them without proper care may not only lead to large quantities of useless information, but might in fact take information that may have been useful and strip any potential usefulness from it.
The Dark Underbelly of Big Data
This dumbing down of the information we may collect is the dark underbelly of the Big Data movement. With our mad rush to the quantity of data we can collect and analyze, we risk foregoing the quality of those data, in hopes that some new analytics engine will magically regain that quality.
We may think of Big Data analytics as analogous to mining for gold, separating the rare bits of precious metal from vast quantities of dross. But we’ll never find our paydirt if we strip away the value during the process of data collection.
The gold metaphor only goes so far, however, as today’s data are far more heterogeneous than a single metal. “Legacy data integration technologies were built for a rows and columns world of structured data,” according to Greg Benson, Chief Scientist at SnapLogic. “Today’s data is often hierarchical (JSON), de-normalized, and evolving.”
Perhaps we should go back to the Online Analytical Processing (OLAP) days, where we carefully process and organize our information ahead of time, in order to facilitate subsequent analysis. Even with today’s Big Data technologies, there are reasons to remain with such a “just-in-case” approach to data management, rather than the just-in-time perspective of data lake proponents.
In reality, however, this choice between just-in-case and just-in-time approaches to data management is a false dichotomy. The best approach is a combination of these extremes, favoring one or the other depending on the nature of the data in question and the purpose that people intend to put them toward.
Moving Up to the Logical Layer of Abstraction
To understand how to approach this decision, it is essential to move up one layer of abstraction. We don’t want to focus our efforts on physical data lakes, but rather logical ones.
With a physical data lake, we can point to the storage technology underlying the lake and rest assured we’ve precisely located our information. With a logical data lake, we can consider multiple data stores – perhaps in one location, or possibly scattered across the cloud, static or in a state of flux – as a single, heterogeneous data lake.
As a result, we only move information when there’s a need to do so, and we only process such information when appropriate, depending upon the goals of the task at hand. Such movement and processing can happen beneath the logical abstraction layer, invisible to the users of analytics tools.
The key to making the logical data lake work properly – at speed and at scale – is an intelligent data integration layer. This underlying technology must transform data before moving it when appropriate, or afterward if that choice better suits the situation. All metadata – including schemas, policies, and the semantic context of the information – must be appropriately preserved and transformed when necessary.
A tall order to be sure – one that traditional data integration technologies struggle with. SnapLogic, however, has implemented data integration technology that supports both just-in-case and just-in-time scenarios. “We designed the SnapLogic platform with just-in-time data in mind,” Benson continues, “while supporting traditional relational data if needed.”
SnapLogic is an elastic integration platform as a service (iPaaS) that delivers real-time, event-driven application integration and batch big data integration for analytics in a single cloud integration and big data integration platform. Without the support of a next-generation platform like SnapLogic delivers, data lakes will be unable to resolve the challenges they face.
SnapLogic is an Intellyx client. At the time of writing, no other organizations mentioned in this article are Intellyx clients. Intellyx retains full editorial control over the content of this article. Image credit: Mark Gregory.
Jason – great article. You took a subject that is drowning in hyperbole and made it real.
Great post. There’s no reason not to approach data from the same Agile perspectives as application development. It’s still common to shove everything into a warehouse or mart, but often it just sits there (on relatively expensive storage media) waiting for use. I believe it’s more useful to think in terms of data sets that can be provisioned on demand, by users themselves, for whatever purpose they need. In other words; quants want big, flat files to load into SPSS, analysts want marts, with a well-defined BI layer and others just want a warehouse of query – these are all transient uses of the same data sets and they don’t necessarily have to live in their respective analytic end points indefinitely. Furthering your analogy, lakes may collect at different collection points along a pipeline (perhaps river!) of data that is enriched or refined data at different stages. Often combining temporal information, variants and versions of data sets must also be kept.