Data Mesh is not a Data Lake!

Data Mesh is not a Data Lake!

Data Mesh is not a Data Lake. Nor is it a Data Lakehouse, or a Data Warehouse.

This may seem obvious, but there are some who kind of conflate Data Mesh to mean a certain ‘style’ of Data Lake, for example as written about here, here and here.

Perhaps some of the confusion goes back to Zhamak’s title of the very popular 2019 paper, “How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh” which includes the term Data Lake right in the title. But if you read the paper and digest the ideas, one of the key failure-modes that Zhamak discusses is siloed and hyper-specialized ownership -- when the domain-oriented source teams (eg; apps & LoB operations) are disconnected from the data & ML platform engineers, who are again disconnected from the domain-oriented consumers.?Thus, any data lake whose data domains (or overall integration design) are disconnected from the domain owners (eg; the operational applications) is not a great example of a Data Mesh! I'm personally always a bit suspect when I see a story about Data Mesh where the technical architecture starts with, 'and so the raw data is here in our Data Lake...'. Like magic!

In the work that we are doing at Oracle, we explicitly aim to make Data Mesh a solution that is useful for both the domain-ownership side (eg; operations) as well as the domain-consumer side (eg; analytics and data lake).

Take for example real, actual lakes.

If we look at lakes in northern California in isolation, they are these large bodies of water separated by great distances.

northern california lakes

Image credit: californiasgreatestlakes.com

But if we look at the wider hydrology of northern California we can see the interconnectedness and the flow of the water throughout the ecosystem.

northern california hydrology

Image credit: muir-way.com

The differences between a Data Lake and a Data Mesh are sort of like that. Whereas a Data Lake is this large body of data in one physical location (eg; object storage in the cloud), the Data Mesh is about the logical and physical interconnectedness of the data from producers through to consumers. In that way, a Data Mesh may include Data Lakes.?

Data Products (ie; the data used by consumers, that data that has a particular 'job to be done') can be produced from data sourced from within a Data Lake itself, but also from the data ‘rivers and streams’ that are flowing from the Systems of Record (SoRs) to the Data Lakes. Data doesn't have to drop into a lake in order to become a Data Product.

Hold up! What about Data Lakehouses?

You may be familiar with the Data Lakehouse concept, popularized by a well known Databricks blog from early 2020.

No alt text provided for this image

The lakehouse concept takes the usual Data Lake concept and adds a few things, such as: ACID transaction support, schema enforcement, stronger SQL support for analytics, and stream processing. Not everyone agrees this is a particularly innovative concept, since this also sounds a lot like modern data warehouses, but that debate is not the purpose of this post...

As discussed in the reference Data Mesh stories at the top of this post, some folks are talking about a Data Mesh as being a kind of Data Lake but with (1) well defined data ‘zones’, (2) a catalog of metadata with strong schema typing on the data, (3) a bit of streaming between data inside the lake, and (4) SQL federation tools that may query the data directly within a lake (eg; reporting from data directly in the lake).

But this is not really a Data Mesh… it is a particular style of using a Data Lake.

In fact, even the use of 'data product thinking' does not in and of itself make a Data Mesh -- because the concepts, methodology and best practices of Data Products can be applied to any kind of data architecture (centralized or distributed).

Streaming within a Data Lake is not a Data Mesh, but a Data Mesh should be able to Stream within a Data Lake!

You can stream data within a lake (eg; Apache Spark Streaming) but that does not make it a Data Mesh.?Without the explicit tie-in to operational data domains (eg; the domain oriented source teams), the overall Data Lake solution remains siloed – data is merely being tossed over the wall from one team to the next.?It takes organizational and technical commitment to join up the data producers to the data consumers, with IT working to provide the over-arching tech stack.?In fact, most data lakes still operate more or less in isolation from the producers of the data.

In what I consider a great example of a Data Mesh, the folks at Intuit specifically include their Key Stakeholders, Pipelines, and consumption APIs as part of the Data Product definition. The Data Lake is one part of the Data Mesh solution, and not even the most important part.

As a discipline, the Data Lake technical concepts are still vast and important (with or without a Data Mesh). Back to the real-world example of actual lakes, within larger lakes there is an entire ecosystem of ‘zones and currents’ within the lakes themselves:

No alt text provided for this image

Limnology (study of zones and ecosystem within a lake), Image credit: Wikipedia

And, there are even 'streams' (currents, and underwater flows) of water within the lakes themselves.

No alt text provided for this image

Currents and water flows within Lake Michigan, Image credit: NOAA

Similarly, in a Data Lakehouse there are many technical planning details related to zones (eg; security zones, data domains, and various zones of data curation/quality such as raw, prepared and master data).?Likewise, we can create streaming data solutions within the lake to continually flow data around within the boundaries (and interfaces) of the lake.

But a Data Mesh is something different – it is a solution that can take care of the movement, distribution and management of data outside *or* within a Data Lake.

Alignment of Operational and Analytic Data Domains

Central to our concept of a Data Mesh is the idea that the same technology can be used for data-driven use cases in Operational Data and Analytic Data domains.

For example, use cases for Data Mesh should span domains:

No alt text provided for this image

This common technology platform makes it much more pragmatic to achieve domain driven design, where business entities and data element are more directly connected – in techie terms, we are trying to reduce the impedance mismatch between producers and consumers.

Thus, we can contemplate Data Mesh for a wide range of powerful use cases like:

  • ????????Application migrations to the cloud
  • ????????Modernizing application monoliths to microservices
  • ????????Data availability (eg; distributed data sharding) for monolithic data stores
  • ????????Event sourcing and CQRS for microservices data patterns
  • ????????Real-time integration among Apps, IoT and Analytics
  • ????????Streaming ingest to data lakes and data warehouses
  • ????????Streaming data pipelines inside or outside of data lakes
  • ????????Stream analytics on data-in-motion

No alt text provided for this image

In this way, we are bringing together and reducing the friction of data that flows among Systems of Record, Systems of Analysis and Systems of Engagement.

We can use the same tech stack to reduce the ‘impedance’ of data processing that occurs between the data producers and the data consumers.

Objectively, we are aiming to align data consumers to the data while requiring minimal data processing inputs from IT as a ‘middleman’ in the process.

Data Mesh is not a Data Lake

A Data Mesh can be used for many more use cases than a Data Lake can (eg; in the Operational data domains).?And, many historical Data Lake designs do not incorporate any principles of a Data Mesh (eg; lacking cohesion with data producers and/or any focus on data product thinking), or for organizational reasons the Data Lake teams remain isolated from the business data producers.

Whereas a Data Lake is conceptually like a real-world lake (with resources collected together in one location), a Data Mesh is more conceptually similar to the hydrology (the movement, distribution and management) of resources in a widely distributed ecosystem.

There are 'modern data lake' solutions out there which are perfectly fine, or even great... but let's try to keep some degree of precision around what is already a complex and confusing topic area! Some data lakes may be part of a data mesh, and some data mesh's may orchestrate streaming data within data lakes... but most data lakes are not a data mesh, at all.

The article highlights the Data Mesh and its importance. However, it is also essential to recognize the reality of big data.

回复
Sandeep Peddi

Data Engineer II, AWS Partner Intelligence | 3x AWS Certified

3 年

Great way to showcase the distinction!

回复
???

???? ?? ???

3 年

It is much informative and clear to me. Thanks Jeff

回复
Denis Gray

Vice President of Product Management at Oracle

3 年

Thanks Jeff! That is a lot of great information. The graphics really help drive home the distinctions.

回复

要查看或添加评论,请登录

Jeffrey T. Pollock的更多文章

  • Data Events: Trust, Transactions and ACID Properties

    Data Events: Trust, Transactions and ACID Properties

    It’s been more than a month since my last post in the Data Mesh blog series. April went by in a flash, I think I was…

    4 条评论
  • Trusted, Polyglot Data Streams

    Trusted, Polyglot Data Streams

    This is part of a Data Mesh blog series here on the LinkedIn articles platform. I am basing this series of posts on…

  • Data Ledgers for Data Integration

    Data Ledgers for Data Integration

    This is part of a Data Mesh blog series here on the LinkedIn articles platform. I am basing this series of posts on…

    6 条评论
  • Decentralized, Modular Data Mesh

    Decentralized, Modular Data Mesh

    This is part of a Data Mesh blog series here on the LinkedIn articles platform. I am basing this series of posts on…

    2 条评论
  • Data Product Thinking and Data Product Managers

    Data Product Thinking and Data Product Managers

    This is part of a Data Mesh blog series here on the LinkedIn articles platform. I am basing this series of posts on…

    6 条评论
  • Data Mesh: 2021 and Beyond

    Data Mesh: 2021 and Beyond

    This is the first of a multi-part series that I plan to cover here on the LinkedIn articles platform. I am basing this…

    24 条评论

社区洞察

其他会员也浏览了