Data Mesh Simplified

Data Mesh Simplified

As the data volume generated by the business grows and the technology evolves, big-name frameworks spring up as solutions to the problems of previous generations. Data Warehousing is one of such, addressing the limitation of original multiple source systems reporting. The core of Data Lake software, Hadoop, is a horizontal-scaling cluster set, which answers the limitations of physical hardware, rigid Data Warehouse requirements, self-serve needs, flexible schema and data types. The concept of cloud on the other hand has grown as a solution to the limitations of isolated clusters and data centers.

As new cloud-related approaches show up, they naturally draw on the cloud concept of data warehousing. One such approach is what was known as the Data Fabric, which evolved a few years ago into the Data Mesh.

Yes, Data Mesh is described as a fundamental “revolution” again. People who have worked in the IT world long enough have seen quite a few of those so-called “revolutions”, one the loudest ones being Hadoop with the Big Data concept. Personally, all I see here is an obvious logical evolution of data management process.

From the data government standpoint, the things are presented this way: Centralized Governed – Data Warehouse, Centralized Ungoverned – Data Lake, and Decentralized Governed – Data Mesh.

Conceptually on the tech level there is no big difference between the Data Fabric and the Data Mesh. Both are about a distributed multi-platform “flat” multi-domain driven approach with a lot of Cloud architecture. But conceptually on the organizational level, the difference is significant:

Data Fabric is an approach that concentrates on technical aspects of distributed domain-driven architecture with multiple data provisioning/consumption points.?

Data Mesh is a concept, a change of organizational philosophy.

Both are pretending to be hybrid/multi-cloud/multi-repository domain team self-serve concepts.

Why is this evolutionary process required? Maybe EDW/EDL platforms combination does not do the job? It does, but a serious problem remains. The systems like EDW/EDM which are properly built do struggle processing nowadays massive data flow. Properly architecture EDWs/EDLs are integrated, cataloged, coordinated and lined up, but their organization process cannot handle permanently increasing huge data flows and the demand from the Business to use this rapidly growing new data.

EDW and EDL hit their limits in a different way:

  1. The EDW is perfect with the data consumption but not good with the new data input and model updates. A simple request from a domain team takes between several weeks to many months (!) to complete. It becomes a frustrating experience for business teams who must wait for the data to be brought in the EDW.
  2. EDW is verified and reliable and the users cannot change a thing. Once again, a request to change something takes months. As a result, business teams are building their own data stores so they can extract the data from EDW and then make changes faster. Doesn’t it sound weird?!
  3. The EDL is perfect in data ingestion but horrible in self-serve data consuming. The users or employees who are not from data science background must be trained to work in Hadoop environment and have appropriate extra tools (edge nodes or/and data access tools, learn Spark or something like that, etc.) for connection and data retrieving. Truth is that Hadoop is really slow and expensive. For the business users consuming data from EDL is a frustrating experience.

The core problem is that existing EDW/EDL downsides cannot be resolved within the current business model. The following solutions can be considered:

  • Add more workforce to the EDW team: increase it by 10-15 times by adding more Data Analysts, Developers, Modellers, and Admins. Similar approach is applicable to the EDL concept where we buy more expensive tools and hire more Hadoop developers. Probably not the most realistic and sustainable approach.
  • Delegate specialized technical tasks to professional services: change the tools to more PaaS or even SaaS that are managed by a provider and thus require less personnel. It is like moving towards outsourcing the maintenance of a big building to professional electricians, carpenters, and painters rather than relying on one individual to do it all. It is a much better approach to succeed on a larger scale.
  • Change the business model: reach a higher level of efficiency by recruiting a non-IT specialized workforce. This approach will require that appropriate training and tooling is provided.
  • Trade immediate quality for speed to create new opportunities: it is much more important in the modern world to have an imperfect result right now than brilliant in 2 years. Data Mesh is doing the same in the Data processing as DevOps has done in the Applications development – more flexible and agile. DevOps/DataOps/Data Mesh vs classical Waterfall software approach is like a factory assembly line vs. a craftsman. Side note, this is where the Data Factory name came from.
  • Replace human-driven operations with extensive process automation: as the productivity of a craftsman or artisan can’t be compared with an effective assembly production line.

Data Mesh is an attempt to address the challenges listed above and it uses all five approaches. It changes the environment into multi-cloud hybrid PaaS/SaaS platforms (it will require investments, yes), it employs automation, it involves more non-IT specialized workforce and functions as a factory assembly line with circular product builds.

Can we call Data Mesh a revolution? No, it is a logical evolution. Like Data Warehouse and Data Lake, Mesh is not a platform or a software, it is a concept, a set of constantly evolving practices to handle huge data flows and involve more Enterprise professionals into Data producing.

It is a new approach and many things are still unclear. The journey is like what we experienced with Data Warehouses and Lakes. Both concepts went through the same hard path full of successes and challenges. It took at least a decade for each to establish the right set of practices and train appropriate personnel.

There are no doubts in my mind that this is a step in the right direction. Extensive automation, cross-enterprise standardization, changing the culture, and using Machine Learning are the future. The decentralization architecture is required to move forward, otherwise we will forever be stuck with old technologies limitations.

Now let’s go through the most obvious problems, pitfalls and hidden cracks on the Data Mesh journey. Keeping those limitations in mind will allow you to adapt your practices to create more insight.

According to Z. Dehghani, who first stated the Data Mesh term and approach, Data Mesh is based upon 4 core?principles:

  1. Domain oriented, decentralized data ownership and architecture
  2. Data as a Product
  3. Self-serve data infrastructure as a Platform
  4. Federated computational governance

Let’s have a look at the principles related with the most obvious challenges.

  1. Domain-oriented decentralized data ownership and architecture.?Sounds intriguing, but in reality, how many domains are big enough and have a professional team to own their architecture? Usually, a large organization has at least 5 very qualified Data Analysts in the core Data and Analytics team to support the DW and related operations. Let’s assume a single domain requires 3 qualified DA-s. Going for a Mesh concept requires dozens of Data Domains, otherwise they become large business units and will have the same problems as any monolithic architecture. Let’s say an organization has 30 domains, thus it needs 90 more DAs on a domain level. The same is the situation with other data professionals – ETL developers, domain administrators, etc. Obviously, the domains will not have those resources at least in near future. Even going for 20 domains will not change the situation. Thus, we have gap A here – lack of professional resources.

At the same moment we have gap B - duplication of efforts and skills needed to maintain data pipelines and infrastructure in each domain. It is a pay-off.

?

  1. Data as a product.?Put in a simple way it means that the data is always ready to be consumed out-of-the-box, for example GPS or Google Map service, a computer game we are buying in a CD store – the data consumed immediately “as is”. Important point: to use a product we need to have appropriate environment – like smartphone/computer with right software, sys-registry set, minimum skillset (e.g. ability to read a map), etc.

Same way Data Product in Data Mesh concept consists of data, metadata, code, and infrastructure dependencies. To consume a Data Product there are necessary: appropriate environment, connections, access rights, and technical skills. Let’s assume that everything is provided by the Enterprise. Are all domain users able to use the ecosystem proper way?

For example, to use a data stream from a domain-producer, a consumer should be able to get a stream/messaging service, like subscribe to a Kafka topic or at least to read a landed/aggregated stream from NoSQL or Hadoop environment. It is gap C – ability to consume a Data Product on the Domain level and at the same time produce it.

It is a common believe that if something is called a “Data Product” than it is easy to use. Nevertheless, a consumable product for a producer and a consumer can mean very different things especially for tech and business people. Is it a data set to be processed further or one column? Is it a “column A minus column B” in a dataset or something like “Find statistical deviation from latest 7 sets, then using Monte-Carlo method you can easily find a local minimum.” Sounds funny? But some financial definitions are much funnier. Who can resolve the conflicts about the data consumability between domains and enforces the resolution? Do we have on the Enterprise level enough specialists for that? We are talking about data flows at least 10-20 times larger than we have. Yes, it is gap D – data consumability conflicts.

  1. Self-serve data platform.?For domain teams to create and consume decentralized data products autonomously using platform abstractions required for the Enterprise level is no trivial feat.

Here is a nasty question pops up again – do these domain teams have appropriate resources to build, run and govern their self-serve data platform? Who can guide and train them? Who can control the process, orchestrate multi-team efforts and resolve conflicts? How decentralized teams impose the cross-organization standards? Data Mesh states that governance needs to align itself to the overarching corporate data strategy and have the authority to enforce that strategy at all levels. Is it cool? Yes, it is but it is very hard to achieve. It is a gap E – orchestrating between the domains. The standardization we can refer to the gap D.

Data management on a domain level vs Enterprise consumption will for sure bring even more struggle with the data integrity. Yes, it is one of the most serious challenges of the Data Mesh – appropriate data integration. This is gap F.

  1. Federated Computational Governance.?Data Mesh provides very little guidance on how it should work in real world. Usually, it is described as data governance standards are defined centrally, but local domain teams have the autonomy and resources to execute these standards. Ok, let’s assume we already closed a gap with domain resources.

There is a more global data governance question – appropriate data cataloging that is not a trivial problem, it is gap G. The data governance tools and catalogs have struggled with this since their appearing in early 2000s, and there is nothing fundamentally unique in Data Mesh to make that job more manageable.? It makes sense to repeat again - we are talking about data flows 10-20 times (potentially – over 100) larger than we have now, and most likely existing data catalogues cannot handle this challenge.

Looks like a lot of tough problems where is hard to realize from which end to start. But if we summarize the challenges from the points above and group them logically, we can see that they can be split into 2 big groups:

  1. Shortage of personnel and lack of domain teams education in data processing, governance, and modeling. (Gaps A, B, C, and D)
  2. Lack of the Enterprise Data Governance and cross-domain interaction, standardization, cataloging and data integration on the Enterprise level. (Gaps E,?F, and G).?

Already does not look so bad, right?

For the first group the solution idea looks quite simple – train the domain teams, help them and make their tasks easier to work with:

  1. Prepare the core domain team(-s)/council(-s) with clear set of roles and responsibilities to educate the local teams in data modeling/processing and proper governance.
  2. Create central dedicated support desk with highly qualified team in the Data Warehousing, Big Data processing, and related areas.
  3. Establish data modeling/processing and governance certification similar way as established with automated certification ticketing system (something like Athena).
  4. Make the data management tasks easier by simplification, standardization, and automation.
  5. Start from small pilot projects/domains to gather real experience and establish integration/interaction practices.?

-------------------------------------------

Pilot use cases

As a start it makes sense to identify one or two use cases that can enable domain driven ownership, create the business domain-based teams, staffed with data specialists, which carefully scope and execute the necessary environmental transformations to deliver their data as a product. Then using concrete experience to cross-train a few more teams to instill the data domain expertise. Then expand the process until the new approach is established everywhere. Applying the same continues improvement/testing & validation approach that we use in DevOps concept looks very helpful here.

?

Automate the data engineering tasks

Critically important is to automate the data engineering tasks wherever possible and embrace IaaS/PaaS to simplify the tactical operations. In this case the expert’s desk should play a key role in the Enterprise education, creating and enforcing the standards. It is an architecture and set of data services that should provide consistent capacities across various endpoints transiting in hybrid multi-cloud environment.

Consistent data product endpoints will look and accessed different ways for structured, unstructured and semi-structured data, for data streams and Distributed Enterprise Data platforms.

?

Define Shared Structure

For the distributed DW the best and easiest way appears good old, dependable Kimball DW that should be run separately by multiple domain teams. Trying to create such DDW/Lakehouse using Inmons’ 3-NF approach does not look robust for this case of multi-team coordination. By the way, we have an Inmon-type EDW in the organization, but it is a highly centralized DW that can be one of the most important domains/nodes though. Using a central platform team for Distributed DW will obviously become a bottleneck.

?

Proper Data Governance

Proper Data Governance becomes critical point in proper Data Mesh practice: centrally governed topology, taxonomy and catalog, technology and data patterns (policies), Universal IDs, shared terminology and definitions – will play the same role in the Data Mesh concept as in DW concept work the “conformed definitions and dimensions”.

Thus, it should be a dedicated team governed maybe by CDO. As well it should be a kind of certification if a data set provided is consumable, secure, has appropriate quality, etc. As Data Mesh suggests, data strategy requires top-down buy-in and bottom-up ownership.

Successful Data Mesh philosophy implementation should accommodate diverse needs but balanced standard list like technology blueprint approve, certified/not certified physical or logical areas (some objects tagged as certified), etc.

?

Ecosystem Governance

Ecosystem Governance: to ensure business owners can trust and share their data products, an enterprise data governance team should implement access controls, cataloging and compliance policies across the distributed domains. This team examines each point in the creation of the data product: is the data can be trustful, if the data owners applied the right constraints on the usage, etc. with appropriate tagging and action plan. They also need to orchestrate a common glossary to minimize the ever-present risk of language barriers between the business units and intelligently distribute data products between Data Domains.

To enable cross-domain collaboration, the Data Mesh must standardize on formatting, governance, discoverability, and metadata fields, among other data features. When we have a central governance team this task is well-known for organization – separation of duties, qualified management, professional team, etc. But how to manage such tasks when we have dozens of teams with very different vision, areas of expertise, etc. How to assign the responsibilities to avoid work duplication and orphaned tasks? How to resolve the conflicts?

?

Distributed/Democratic vs Centralized/Dictated

The core idea of distributed systems is a democratic approach in contrast to traditional army-style centralized structures. The idea of Data Mesh is to keep decision making as local (democratic) as possible. The domain team ingests operational data and builds analytical data models to perform their analysis. It uses analytical data to create data products based on other domains' needs. Each data domain must define and agree on SLAs and quality measures that they will “guarantee” to its consumers. An individual Data Product works similar way as an individual microservice but with appropriate and standardize interface otherwise effective cross-domain interaction becomes hardly possible.

How the distributed systems (not only Mesh) can work as a synergetic one, not only a collection of separated domains that will often conflict to each other?

It makes sense to suggest 3 steps as any successful democratic structures (like countries) work, then the next step should be taken if a previous one does not work properly.

  1. Negotiating – the nicest way of doing things when the parts coming to common solution acceptable for everyone.
  2. Voting between equal is a way to resolve the cross-domain discrepancies when negotiating does not bring the result. Then quoruming by authorized teams is a good solution. Quorum easily can work automatically through appropriate voting interface and link the result to an in-force solution. Quoruming does not necessarily means equal voting, some participants may have bigger weight than others.
  3. As the last step “the federal government” steps in. For example, the central team has to have in their mandate the right to block data distribution because of poor data quality (e.g. id-s do not match), unacceptable data integration or security problems but it should be more an exception than a common practice.

?

Automation, automation, automation

A very important way for such distributed system that should help a lot with the tasks above is AI and collective decision-making systems. They should automate task assigning, tracking, linking, looking for analogies, call for voting, etc. It is not a software or a platform that can make decision for a human. It is a system that helps a human to collaborate with the AI and other humans using AI interface. This is not just simple sum of human skills in a team and an AI capability, it is integral structure with very different, more collaborative and more capable culture. We are not there yet but just couple of years ago we could not guess what the Generative AI will be capable of and it this is the future.

We should keep in mind that Data Mesh is planned to bring fewer hops and simplify data processing than traditional approach has. The Mesh should decrease the data bureaucracy, not increase it.

Proper using AI/ML approaches and automation tools is critical for this new concept in multiple areas: to find gaps in data integrity, data quality, linage, discovering new rules and patterns, etc. These tools are targeting both local and the Enterprise levels.

Modern Enterprise ecosystem planned to be under Mesh domains includes all types application DBs, Lakes (Hadoop or S3-based), on-place and distributed DW, and Lakehouses with some area of shared file systems (NAS or its substitute on Cloud).

To address problem of the duplication efforts, the Data Mesh gleans the domain-agnostic data infrastructure capabilities into single platform that handles core processes to monitor and manage our data sets regardless of where they reside:

·???????? Data product schema

·???????? Data product versioning

·???????? Data discovery, cataloging, and product publishing

·???????? Data Governance and standardization

·???????? Data Lineage

·???????? Data product quality metrics

·???????? Data product Monitoring, Alerting, and Logging

·???????? Data pipeline engines

·???????? Encryption for data at rest and in motion

·???????? Data storages

·???????? Data Ingestion architecture

Thus, the Data Mesh architecture standardizes the data management practices for the Enterprise platforms. It supposed to provide the Enterprise level data with visibility and insights, data access and control, data protection and security. That’s what the Data Mesh concept is originally had been developed for.?But it is not easy. Actually Data Mesh is the most complicated and demanding Data Architecture among all of them.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了