Data Mesh Simplified
As the data volume generated by the business grows and the technology evolves, big-name frameworks spring up as solutions to the problems of previous generations. Data Warehousing is one of such, addressing the limitation of original multiple source systems reporting. The core of Data Lake software, Hadoop, is a horizontal-scaling cluster set, which answers the limitations of physical hardware, rigid Data Warehouse requirements, self-serve needs, flexible schema and data types. The concept of cloud on the other hand has grown as a solution to the limitations of isolated clusters and data centers.
As new cloud-related approaches show up, they naturally draw on the cloud concept of data warehousing. One such approach is what was known as the Data Fabric, which evolved a few years ago into the Data Mesh.
Yes, Data Mesh is described as a fundamental “revolution” again. People who have worked in the IT world long enough have seen quite a few of those so-called “revolutions”, one the loudest ones being Hadoop with the Big Data concept. Personally, all I see here is an obvious logical evolution of data management process.
From the data government standpoint, the things are presented this way: Centralized Governed – Data Warehouse, Centralized Ungoverned – Data Lake, and Decentralized Governed – Data Mesh.
Conceptually on the tech level there is no big difference between the Data Fabric and the Data Mesh. Both are about a distributed multi-platform “flat” multi-domain driven approach with a lot of Cloud architecture. But conceptually on the organizational level, the difference is significant:
Data Fabric is an approach that concentrates on technical aspects of distributed domain-driven architecture with multiple data provisioning/consumption points.?
Data Mesh is a concept, a change of organizational philosophy.
Both are pretending to be hybrid/multi-cloud/multi-repository domain team self-serve concepts.
Why is this evolutionary process required? Maybe EDW/EDL platforms combination does not do the job? It does, but a serious problem remains. The systems like EDW/EDM which are properly built do struggle processing nowadays massive data flow. Properly architecture EDWs/EDLs are integrated, cataloged, coordinated and lined up, but their organization process cannot handle permanently increasing huge data flows and the demand from the Business to use this rapidly growing new data.
EDW and EDL hit their limits in a different way:
The core problem is that existing EDW/EDL downsides cannot be resolved within the current business model. The following solutions can be considered:
Data Mesh is an attempt to address the challenges listed above and it uses all five approaches. It changes the environment into multi-cloud hybrid PaaS/SaaS platforms (it will require investments, yes), it employs automation, it involves more non-IT specialized workforce and functions as a factory assembly line with circular product builds.
Can we call Data Mesh a revolution? No, it is a logical evolution. Like Data Warehouse and Data Lake, Mesh is not a platform or a software, it is a concept, a set of constantly evolving practices to handle huge data flows and involve more Enterprise professionals into Data producing.
It is a new approach and many things are still unclear. The journey is like what we experienced with Data Warehouses and Lakes. Both concepts went through the same hard path full of successes and challenges. It took at least a decade for each to establish the right set of practices and train appropriate personnel.
There are no doubts in my mind that this is a step in the right direction. Extensive automation, cross-enterprise standardization, changing the culture, and using Machine Learning are the future. The decentralization architecture is required to move forward, otherwise we will forever be stuck with old technologies limitations.
Now let’s go through the most obvious problems, pitfalls and hidden cracks on the Data Mesh journey. Keeping those limitations in mind will allow you to adapt your practices to create more insight.
According to Z. Dehghani, who first stated the Data Mesh term and approach, Data Mesh is based upon 4 core?principles:
Let’s have a look at the principles related with the most obvious challenges.
At the same moment we have gap B - duplication of efforts and skills needed to maintain data pipelines and infrastructure in each domain. It is a pay-off.
?
Same way Data Product in Data Mesh concept consists of data, metadata, code, and infrastructure dependencies. To consume a Data Product there are necessary: appropriate environment, connections, access rights, and technical skills. Let’s assume that everything is provided by the Enterprise. Are all domain users able to use the ecosystem proper way?
For example, to use a data stream from a domain-producer, a consumer should be able to get a stream/messaging service, like subscribe to a Kafka topic or at least to read a landed/aggregated stream from NoSQL or Hadoop environment. It is gap C – ability to consume a Data Product on the Domain level and at the same time produce it.
It is a common believe that if something is called a “Data Product” than it is easy to use. Nevertheless, a consumable product for a producer and a consumer can mean very different things especially for tech and business people. Is it a data set to be processed further or one column? Is it a “column A minus column B” in a dataset or something like “Find statistical deviation from latest 7 sets, then using Monte-Carlo method you can easily find a local minimum.” Sounds funny? But some financial definitions are much funnier. Who can resolve the conflicts about the data consumability between domains and enforces the resolution? Do we have on the Enterprise level enough specialists for that? We are talking about data flows at least 10-20 times larger than we have. Yes, it is gap D – data consumability conflicts.
Here is a nasty question pops up again – do these domain teams have appropriate resources to build, run and govern their self-serve data platform? Who can guide and train them? Who can control the process, orchestrate multi-team efforts and resolve conflicts? How decentralized teams impose the cross-organization standards? Data Mesh states that governance needs to align itself to the overarching corporate data strategy and have the authority to enforce that strategy at all levels. Is it cool? Yes, it is but it is very hard to achieve. It is a gap E – orchestrating between the domains. The standardization we can refer to the gap D.
Data management on a domain level vs Enterprise consumption will for sure bring even more struggle with the data integrity. Yes, it is one of the most serious challenges of the Data Mesh – appropriate data integration. This is gap F.
There is a more global data governance question – appropriate data cataloging that is not a trivial problem, it is gap G. The data governance tools and catalogs have struggled with this since their appearing in early 2000s, and there is nothing fundamentally unique in Data Mesh to make that job more manageable.? It makes sense to repeat again - we are talking about data flows 10-20 times (potentially – over 100) larger than we have now, and most likely existing data catalogues cannot handle this challenge.
Looks like a lot of tough problems where is hard to realize from which end to start. But if we summarize the challenges from the points above and group them logically, we can see that they can be split into 2 big groups:
Already does not look so bad, right?
For the first group the solution idea looks quite simple – train the domain teams, help them and make their tasks easier to work with:
-------------------------------------------
Pilot use cases
As a start it makes sense to identify one or two use cases that can enable domain driven ownership, create the business domain-based teams, staffed with data specialists, which carefully scope and execute the necessary environmental transformations to deliver their data as a product. Then using concrete experience to cross-train a few more teams to instill the data domain expertise. Then expand the process until the new approach is established everywhere. Applying the same continues improvement/testing & validation approach that we use in DevOps concept looks very helpful here.
领英推荐
?
Automate the data engineering tasks
Critically important is to automate the data engineering tasks wherever possible and embrace IaaS/PaaS to simplify the tactical operations. In this case the expert’s desk should play a key role in the Enterprise education, creating and enforcing the standards. It is an architecture and set of data services that should provide consistent capacities across various endpoints transiting in hybrid multi-cloud environment.
Consistent data product endpoints will look and accessed different ways for structured, unstructured and semi-structured data, for data streams and Distributed Enterprise Data platforms.
?
Define Shared Structure
For the distributed DW the best and easiest way appears good old, dependable Kimball DW that should be run separately by multiple domain teams. Trying to create such DDW/Lakehouse using Inmons’ 3-NF approach does not look robust for this case of multi-team coordination. By the way, we have an Inmon-type EDW in the organization, but it is a highly centralized DW that can be one of the most important domains/nodes though. Using a central platform team for Distributed DW will obviously become a bottleneck.
?
Proper Data Governance
Proper Data Governance becomes critical point in proper Data Mesh practice: centrally governed topology, taxonomy and catalog, technology and data patterns (policies), Universal IDs, shared terminology and definitions – will play the same role in the Data Mesh concept as in DW concept work the “conformed definitions and dimensions”.
Thus, it should be a dedicated team governed maybe by CDO. As well it should be a kind of certification if a data set provided is consumable, secure, has appropriate quality, etc. As Data Mesh suggests, data strategy requires top-down buy-in and bottom-up ownership.
Successful Data Mesh philosophy implementation should accommodate diverse needs but balanced standard list like technology blueprint approve, certified/not certified physical or logical areas (some objects tagged as certified), etc.
?
Ecosystem Governance
Ecosystem Governance: to ensure business owners can trust and share their data products, an enterprise data governance team should implement access controls, cataloging and compliance policies across the distributed domains. This team examines each point in the creation of the data product: is the data can be trustful, if the data owners applied the right constraints on the usage, etc. with appropriate tagging and action plan. They also need to orchestrate a common glossary to minimize the ever-present risk of language barriers between the business units and intelligently distribute data products between Data Domains.
To enable cross-domain collaboration, the Data Mesh must standardize on formatting, governance, discoverability, and metadata fields, among other data features. When we have a central governance team this task is well-known for organization – separation of duties, qualified management, professional team, etc. But how to manage such tasks when we have dozens of teams with very different vision, areas of expertise, etc. How to assign the responsibilities to avoid work duplication and orphaned tasks? How to resolve the conflicts?
?
Distributed/Democratic vs Centralized/Dictated
The core idea of distributed systems is a democratic approach in contrast to traditional army-style centralized structures. The idea of Data Mesh is to keep decision making as local (democratic) as possible. The domain team ingests operational data and builds analytical data models to perform their analysis. It uses analytical data to create data products based on other domains' needs. Each data domain must define and agree on SLAs and quality measures that they will “guarantee” to its consumers. An individual Data Product works similar way as an individual microservice but with appropriate and standardize interface otherwise effective cross-domain interaction becomes hardly possible.
How the distributed systems (not only Mesh) can work as a synergetic one, not only a collection of separated domains that will often conflict to each other?
It makes sense to suggest 3 steps as any successful democratic structures (like countries) work, then the next step should be taken if a previous one does not work properly.
?
Automation, automation, automation
A very important way for such distributed system that should help a lot with the tasks above is AI and collective decision-making systems. They should automate task assigning, tracking, linking, looking for analogies, call for voting, etc. It is not a software or a platform that can make decision for a human. It is a system that helps a human to collaborate with the AI and other humans using AI interface. This is not just simple sum of human skills in a team and an AI capability, it is integral structure with very different, more collaborative and more capable culture. We are not there yet but just couple of years ago we could not guess what the Generative AI will be capable of and it this is the future.
We should keep in mind that Data Mesh is planned to bring fewer hops and simplify data processing than traditional approach has. The Mesh should decrease the data bureaucracy, not increase it.
Proper using AI/ML approaches and automation tools is critical for this new concept in multiple areas: to find gaps in data integrity, data quality, linage, discovering new rules and patterns, etc. These tools are targeting both local and the Enterprise levels.
Modern Enterprise ecosystem planned to be under Mesh domains includes all types application DBs, Lakes (Hadoop or S3-based), on-place and distributed DW, and Lakehouses with some area of shared file systems (NAS or its substitute on Cloud).
To address problem of the duplication efforts, the Data Mesh gleans the domain-agnostic data infrastructure capabilities into single platform that handles core processes to monitor and manage our data sets regardless of where they reside:
·???????? Data product schema
·???????? Data product versioning
·???????? Data discovery, cataloging, and product publishing
·???????? Data Governance and standardization
·???????? Data Lineage
·???????? Data product quality metrics
·???????? Data product Monitoring, Alerting, and Logging
·???????? Data pipeline engines
·???????? Encryption for data at rest and in motion
·???????? Data storages
·???????? Data Ingestion architecture
Thus, the Data Mesh architecture standardizes the data management practices for the Enterprise platforms. It supposed to provide the Enterprise level data with visibility and insights, data access and control, data protection and security. That’s what the Data Mesh concept is originally had been developed for.?But it is not easy. Actually Data Mesh is the most complicated and demanding Data Architecture among all of them.