Data Mesh is not a Data Lake!
Data Mesh is not a Data Lake. Nor is it a Data Lakehouse, or a Data Warehouse.
This may seem obvious, but there are some who kind of conflate Data Mesh to mean a certain ‘style’ of Data Lake, for example as written about here, here and here.
Perhaps some of the confusion goes back to Zhamak’s title of the very popular 2019 paper, “How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh” which includes the term Data Lake right in the title. But if you read the paper and digest the ideas, one of the key failure-modes that Zhamak discusses is siloed and hyper-specialized ownership -- when the domain-oriented source teams (eg; apps & LoB operations) are disconnected from the data & ML platform engineers, who are again disconnected from the domain-oriented consumers.?Thus, any data lake whose data domains (or overall integration design) are disconnected from the domain owners (eg; the operational applications) is not a great example of a Data Mesh! I'm personally always a bit suspect when I see a story about Data Mesh where the technical architecture starts with, 'and so the raw data is here in our Data Lake...'. Like magic!
In the work that we are doing at Oracle, we explicitly aim to make Data Mesh a solution that is useful for both the domain-ownership side (eg; operations) as well as the domain-consumer side (eg; analytics and data lake).
Take for example real, actual lakes.
If we look at lakes in northern California in isolation, they are these large bodies of water separated by great distances.
Image credit: californiasgreatestlakes.com
But if we look at the wider hydrology of northern California we can see the interconnectedness and the flow of the water throughout the ecosystem.
Image credit: muir-way.com
The differences between a Data Lake and a Data Mesh are sort of like that. Whereas a Data Lake is this large body of data in one physical location (eg; object storage in the cloud), the Data Mesh is about the logical and physical interconnectedness of the data from producers through to consumers. In that way, a Data Mesh may include Data Lakes.?
Data Products (ie; the data used by consumers, that data that has a particular 'job to be done') can be produced from data sourced from within a Data Lake itself, but also from the data ‘rivers and streams’ that are flowing from the Systems of Record (SoRs) to the Data Lakes. Data doesn't have to drop into a lake in order to become a Data Product.
Hold up! What about Data Lakehouses?
You may be familiar with the Data Lakehouse concept, popularized by a well known Databricks blog from early 2020.
The lakehouse concept takes the usual Data Lake concept and adds a few things, such as: ACID transaction support, schema enforcement, stronger SQL support for analytics, and stream processing. Not everyone agrees this is a particularly innovative concept, since this also sounds a lot like modern data warehouses, but that debate is not the purpose of this post...
As discussed in the reference Data Mesh stories at the top of this post, some folks are talking about a Data Mesh as being a kind of Data Lake but with (1) well defined data ‘zones’, (2) a catalog of metadata with strong schema typing on the data, (3) a bit of streaming between data inside the lake, and (4) SQL federation tools that may query the data directly within a lake (eg; reporting from data directly in the lake).
But this is not really a Data Mesh… it is a particular style of using a Data Lake.
In fact, even the use of 'data product thinking' does not in and of itself make a Data Mesh -- because the concepts, methodology and best practices of Data Products can be applied to any kind of data architecture (centralized or distributed).
Streaming within a Data Lake is not a Data Mesh, but a Data Mesh should be able to Stream within a Data Lake!
You can stream data within a lake (eg; Apache Spark Streaming) but that does not make it a Data Mesh.?Without the explicit tie-in to operational data domains (eg; the domain oriented source teams), the overall Data Lake solution remains siloed – data is merely being tossed over the wall from one team to the next.?It takes organizational and technical commitment to join up the data producers to the data consumers, with IT working to provide the over-arching tech stack.?In fact, most data lakes still operate more or less in isolation from the producers of the data.
In what I consider a great example of a Data Mesh, the folks at Intuit specifically include their Key Stakeholders, Pipelines, and consumption APIs as part of the Data Product definition. The Data Lake is one part of the Data Mesh solution, and not even the most important part.
领英推荐
As a discipline, the Data Lake technical concepts are still vast and important (with or without a Data Mesh). Back to the real-world example of actual lakes, within larger lakes there is an entire ecosystem of ‘zones and currents’ within the lakes themselves:
Limnology (study of zones and ecosystem within a lake), Image credit: Wikipedia
And, there are even 'streams' (currents, and underwater flows) of water within the lakes themselves.
Currents and water flows within Lake Michigan, Image credit: NOAA
Similarly, in a Data Lakehouse there are many technical planning details related to zones (eg; security zones, data domains, and various zones of data curation/quality such as raw, prepared and master data).?Likewise, we can create streaming data solutions within the lake to continually flow data around within the boundaries (and interfaces) of the lake.
But a Data Mesh is something different – it is a solution that can take care of the movement, distribution and management of data outside *or* within a Data Lake.
Alignment of Operational and Analytic Data Domains
Central to our concept of a Data Mesh is the idea that the same technology can be used for data-driven use cases in Operational Data and Analytic Data domains.
For example, use cases for Data Mesh should span domains:
This common technology platform makes it much more pragmatic to achieve domain driven design, where business entities and data element are more directly connected – in techie terms, we are trying to reduce the impedance mismatch between producers and consumers.
Thus, we can contemplate Data Mesh for a wide range of powerful use cases like:
In this way, we are bringing together and reducing the friction of data that flows among Systems of Record, Systems of Analysis and Systems of Engagement.
We can use the same tech stack to reduce the ‘impedance’ of data processing that occurs between the data producers and the data consumers.
Objectively, we are aiming to align data consumers to the data while requiring minimal data processing inputs from IT as a ‘middleman’ in the process.
Data Mesh is not a Data Lake
A Data Mesh can be used for many more use cases than a Data Lake can (eg; in the Operational data domains).?And, many historical Data Lake designs do not incorporate any principles of a Data Mesh (eg; lacking cohesion with data producers and/or any focus on data product thinking), or for organizational reasons the Data Lake teams remain isolated from the business data producers.
Whereas a Data Lake is conceptually like a real-world lake (with resources collected together in one location), a Data Mesh is more conceptually similar to the hydrology (the movement, distribution and management) of resources in a widely distributed ecosystem.
There are 'modern data lake' solutions out there which are perfectly fine, or even great... but let's try to keep some degree of precision around what is already a complex and confusing topic area! Some data lakes may be part of a data mesh, and some data mesh's may orchestrate streaming data within data lakes... but most data lakes are not a data mesh, at all.
The article highlights the Data Mesh and its importance. However, it is also essential to recognize the reality of big data.
Data Engineer II, AWS Partner Intelligence | 3x AWS Certified
3 年Great way to showcase the distinction!
???? ?? ???
3 年It is much informative and clear to me. Thanks Jeff
Vice President of Product Management at Oracle
3 年Thanks Jeff! That is a lot of great information. The graphics really help drive home the distinctions.