The Role of Data Lake in a Data Mesh
From my experience, quite a few people still see the Data Lake or Lakehouse as an opposing concept to the Data Mesh. A lot might come from the fact that in one of the earlier publications of the Data Mesh, the monolithic design enforced by most data lake implementations is rightfully depicted as flawed and opposing the Data Mesh approach. But that criticism aims at the chosen topology, not the technology or architecture itself.
Here, I would like to highlight how the potential benefits of a data lake can be critical to a data mesh architecture, and those benefits have never changed. But, contrary, we can see a trend where the role of a Data Lake will become even more critical to a mesh architecture.
Data Increases in Value When Being Shared
The Data Mesh proposes a federated and governed domain-oriented data ownership where data is shared and consumed as products. The consumption through "data sharing" is one of the key aspects here.
Data as a product (1):
Any data sharing activity generally starts with combining data from various fragmented resources in one repository, which a company can then use to complete analyses and establish platforms. A repository can be anything in the range of a data warehouse, data lake, or file system. Data warehouses are repositories of structured data sets. The data has already been selected from different sources, cleaned, and integrated into a predefined structure. Data lakes are repositories of unstructured data combined without the initial cleaning step; a company can structure the data as needed for specific applications.
The data lake as nodes on the mesh (2):
领英推荐
At this point, it's worth mentioning that the Lakehouse aims to enhance the data lake as a structured data repository, making it unnecessary to have a mix of repositories (at one node) like a data warehouse and a data lake.
It should be clear now that data lakes and data mesh are not opposing concepts and that the data lake can be a critical component of any data mesh implementation. However, we haven't discussed what makes a data lake essential in a data mesh architecture.
The Benefits of a Data Lake
Even before, with a monolithic design, the benefits of a data lake have primarily stayed the same. I would summarize them as follows:
The last point above will likely make data lakes even more critical to future data lakes.
Data Lakes Might Even Matter More
A recent interview with Datanami Matei Zaharia, CTO of Databricks, reveals that they see a double-digit percentage of workloads using streaming. Databricks considers this a trend where enterprises want to build operational applications with their data. It is certainly not every company but what is driving this trend are applications where actions on incoming data are operationalized.?
Under the assumption of such a trend, data lakes will likely play a more significant role in a data mesh architecture.
Technical Architect - D & A, Scrum Certified CSM | 14x Databricks I 7xAzure Certified I 6xIBM Certified I 5xSisense Certified I 6xTableau Certified
2 年Data mesh is coined title to me and the back end concept is similar to Datamart with confirmed dimensions to cater business requirements of different entities in the organization. with added flavor of advanced technology and flexibility to cover major use cases in data landscape. I believe Data strategy is key for Data Mesh success.
It may be worth mentioning that the Data Lake was originally meant to store data from one source only, not to be a monolith where all the data from all the sources are stored. And if we keep that in mind then it is evident that a Data Lake actually fits very well into Data Mesh.
Data, Analytics & AI @ Creative Data | EU Funding | Pluralsight Author
2 年I wouldn't consider Data Lake as a concept but rather a storage type and an alternative to a relational database, that's why I also don't see any contradictions in this case. However, I do see contradiction when it comes to a data warehouse/lakehouse and a "data mesh". The former concept is centralized and subject-oriented (ensuring consistency across data) whereas the latter one is decentralized and domain-oriented.