Evolving Data Architecture Patterns – Data Fabric & Data Mesh
When the hype about Big Data and 3V’s of data started, most organizations went on to collect data from across the enterprise, trying to establish the Data lake. Research Firms and Technologists convinced the world that data is living in silos and organizations need to collect all data in a Data Lake as single source of truth. Many orgs failed in that pursuit due to several reasons such as lack of skilled resources, new and broad technology landscape, lack of data governance & policies etc. Even if some succeeded in establishing the Data Lake, most failed to tap into it and derive data insights. In other words, the data journey turned out to be an expensive affair due to heavy investment in software licenses, hardware and building teams of wide skillsets. Justification given now is that because so much of data from different domains in a single place, it’s very hard to tap into it and so we should go for federated Data lake. Funniest I have heard is "Data Lake is not keeping it's promise ". But doesn’t this approach take us back to data silos? Sound like a familiar paradox????????????????????????????????????????????
So, what went wrong especially in cases, where in Data Lake was successfully established by extracting data from the source systems? Why exactly orgs couldn’t tap into the data? Main reason is Lack of Data Democratization, which enables end users to seamlessly access the data they want to without being dependent on data engineers. Other reasons that vary with implementations approaches are data quality, data governance, data integrity etc. In other words, many orgs failed or ignored to control the data while busy pushing it into the Data lake. So, what’s the Solution to these problems?
Solution needs to address and comprise of multiple aspects: Data Democratization, Automation, Data Governance, Data as Service/Product etc.?The idea is to make the quality data available to all on demand by eliminating or minimizing the dependencies on IT & data engineering teams.
Following two methodologies and architecture approaches are gaining popularity to address this:
1.???Data Mesh: Data Mesh architecture is based on Domain Driven Design and aims at delivering Data as a Product (DAAP). Idea is to give the ownership and onus to the domain teams to build & govern the data products and expose the service to serve the data product to other domains, a concept called Data as a Service (DAAS). All data doesn’t need to sit in a single Data lake but consists of its own set of data stores such as object storage, DB, DW, Data Lake etc. In other words, Data Mesh relies on concept of Federated Data. Rather than looking at enterprise data as one huge data repository, data mesh considers it as a set of repositories of data products. Following Diagram below Data Mesh Logical Architecture:
领英推荐
2.???Data Fabric: Data Fabric pattern emphasizes on building a knowledge graph of metadata that holds relationships between data sources. Machine Learning and upcoming technologies such as semantic knowledge graphs and active metadata management are aimed at facilitating data fabric architecture. Also, Data Fabric pattern relies on Data Virtualization, a concept that doesn’t require ingestion of data beforehand but access it via metadata store dynamically with clever techniques like caching and push down query optimization. Examples of Data Fabric Tools are: DataFlex, Atlan, Cinchy, data.world, Denode, K2View, IBM Cloud Pak etc. Following Diagram below Data Fabric Architecture:
Question is – Are Data Mesh, Data Fabric, Data Lake replacement of each other? Not Really. What’s the guarantee that having Decentralized domain wise governance as dictated by Data Mesh won’t pose other type of challenges? For ex - it might need separate teams of data scientists and data engineers in each domain/BU, adding to additional cost. What’s the guarantee that having Data Virtualization approach to access the data without having to move the data but on demand won’t pose any performance or other challenges? What’s the guarantee that ML based automated knowledge graph as dictated by Data Fabric will ensure Data Integrity? It is yet to be seen whether these new data management approaches add business value or complexity. Choice of data architecture should depend upon number of factors within an organization such as, number of domains or business units, volumes, types & sensitivity of data, workload types – analytical or transactional, org policies, SLAs, Industry regulations, Use cases, current Technology landscape etc. For example – Use cases like Customer 360 might need data from different domains and yet need curation/aggregation at many levels, then it makes sense to establish the DataLake. You can also have segregated domain wise buckets within the DataLake governed by respective domains and have domain wise micro services to access the data.??OR you can have hybrid of all, encompassing all data architecture patterns. Haven’t we been doing this for decades – having multiple design patterns in the Solution Architecture aimed to solve specific business problems and use cases??In summary, Data Mesh is more of organizational change than architecture change and Data Fabric is more of architecture change at the core.?Each has its own pros and cons and pose new challenges. One needs to do due diligence for assessing the fitment of data architecture patterns in his or her organization before adopting the Data Architecture approach or pattern. For example: a weighted scoring based on selection parameters. Be judicious in choosing the Data Architecture pattern(s) or data can become an untamed BEAST.
Director, Supply Chain Operations Consulting
3 年Good article Abhishek. Many of the "data initiative" failures are indeed associated with data quality, data governance, and data integrity. However, these are all “systemic” components and with guidance, can be aligned and implemented correctly from day 1.? With the support of a strategic data team (vendor-agnostic) that includes all the relevant skills and capabilities a simple, yet effective stepwise approach could make a significant and meaningful impact on the expected outcome. It further reveals the emphasis required on architecture, or organisational change and to what extent data needs to be democratised and personalised. Five high-level steps to follow:? 1) Discovery - due diligence to determine what, why when, where, who, and how? 2) Data preparation 3) Data unification and curation? 4) Delivery 5) Data consumption – aligned with the business objectives, digital strategy, and action plans defined during discovery
Enterprise Architect|Strategy & Data Leadership|Tech Innovation
3 年Great insights Abhishek Mittal!
MLOps l Data Engineering Consultant l Data Architect l Technology Architect l Cloud Platform l AWS l Patent Owner l Hortonworks/Oracle/AWS/SnowFlake Certified
3 年Nice one