Episode 3- Data Mesh Principles: Domain-Driven Architecture and Product Thinking
Introduction:
I am Beshoy Gamal, Bigdata and Machine Learning geek, I have worked on implementing data-driven solutions?for more than 9 years in cross countries in the world and cross technologies from on-primes and cloud, and now I am working in Vodafone Group as Senior Data Architect.
From my all experience, I found that many organizations have invested in a central data lake and a data team with the expectation to drive their business based on data. However, after a few initial quick wins, they notice that?the central data team often becomes a bottleneck,?as they?cannot handle all the analytical questions of management and product owners quickly enough
So I have decided to write this series of articles about the DataMesh, Data Product, Selfe Services, and Data Democratization
Data and distributed domain-driven architecture convergence
In previous episode we studied the Evolution of the Data Mesh, and took it from history perspective on three layers:
1- The Evolution of Data Analytics
2- The Evolution of Data Domains
3- The Evolution of Data Teams
Now we we will talk a deep look on Domain-Driven Architecture and Product Thinking
Domain-oriented data decomposition and ownership
Eric Evans's book?Domain-Driven Design?has deeply influenced modern architectural thinking, and consequently the organizational modeling.?It has influenced the microservices architecture by decomposing the systems into distributed services built around business domain capabilities.?It has fundamentally changed how the teams form, so that a team can independently and autonomously own a domain capability.
Though we have adopted domain-oriented decomposition and ownership when implementing operational capabilities, curiously we have disregarded the notion of business domains when it comes to data.?The closest application of DDD in data platform architecture is for source operational systems to emit their business?Domain Events?and for monolithic data platform to ingest them.?However beyond the point of ingestion the concept of domains and the ownership of the domain data by different teams is lost.
Domain Bounded Context is a wonderfully powerful tool to design the ownership of the datasets.?Ben Stopford's?Data Dichotomy?article unpacks the concept of sharing of domain datasets through streams.
In order to decentralize the monolithic data platform, we need to reverse how we think about data, it's locality and ownership.?Instead of?flowing?the data from domains into a centrally owned data lake or platform, domains need to?host and serve?their domain datasets in an easily consumable way.
In our example, instead of imagining data flowing from media players into some sort of centralized place for a centralized team to receive,?why not imagine a player domain owning and serving their datasets for access by any team for any purpose downstream.?The physical location where the datasets actually reside and how they flow, is a technical implementation of the 'player domain'.?The physical storage could certainly be a centralized infrastructure such as Amazon S3 buckets but player datasets content and ownership remains with the domain generating them.?Similarly in our example, the 'recommendations' domain creates datasets in a format that is suitable for its application, such as a graph database, while consuming the player datasets.?If there are other domains such as 'new artist discovery domain' which find the 'recommendation domain' graph dataset useful, they can choose to pull and access that.
This implies that we may duplicate data in different domains as we transform them into a shape that is suitable for that particular domain, e.g. a time series play event to related artists graph.
This requires shifting our thinking from a?push and ingest, traditionally through ETLs and more recently through event streams, to?serving and pull?model across all domains.
The?architectural quantum?in a domain oriented data platform, is a?domain?and not the pipeline stage.
Source-oriented domain data
Some domains naturally align with the source, where the data originates.?The?source domain datasets?represent the?facts and reality of the business.?The?source domain datasets?capture the data that is mapped very closely to what the operational systems of their origin,?systems of reality, generate.?In our example facts of the business such as 'how the users are interacting with the services', or 'the process of onboarding labels' lead to creation of domain datasets such as 'user click streams',?'audio play quality stream' and 'onboarded labels'.?These facts are best known and generated by the operational systems that sit at the point of origin.?For example the media player system knows best about the 'user click streams'.
In a mature and ideal situation, an operational system and it's team or organizational unit,?are not only responsible for providing business capabilities but also responsible for providing the?truths of their business domain?as source domain datasets.?At enterprise scale there is never a one to one mapping between a domain concept and a source system.?There are often many systems that can serve parts of the data that belongs to a domain, some legacy and some easy to change.?Hence there might be many?source aligned datasets?aka?reality datasets?that ultimately need to be aggregated to a cohesive domain aligned dataset.
The business facts are best presented as business?Domain Events, can be stored and served as distributed logs of time-stamped events for any authorized consumer to access.
In addition to timed events, source data domains should also provide easily consumable historical snapshots of the source domain datasets, aggregated over a time interval that closely reflects the interval of change for their domain.?For example in an 'onboarded labels' source domain, which shows the labels of the artists that provide music to the streaming business,?aggregating the onboarded labels on a monthly basis is a reasonable view to provide in addition to the events generated through the process of onboarding labels.
Note that the source aligned domain datasets must be separated from the internal source systems' datasets.?The nature of the domain datasets is very different from the internal data that the operational systems use to do their job.?They have a much larger volume, represent immutable timed facts, and change less frequently than their systems.?For this reason the actual underlying storage must be suitable for big data, and separate from the existing operational databases.?Section?Data and self-serve platform design convergence?describes how to create big data storage and serving infrastructure.
Source domain datasets are the most foundational datasets and change less often, as the facts of business don't change that frequently.?These domain datasets are expected to be permanently captured and made available, so that as the organization evolves its?data-driven?and?intelligence?services they can always go back to the business facts, and create new aggregations or projections.
Note that source domain datasets represent closely the?raw data?at the point of creation, and are not fitted or modeled for a particular consumer.
Consumer oriented and shared domain data
Some domains align closely with the consumption.?The consumer domain datasets and the teams who own them, aim to satisfy a closely related group of use cases.?For example the 'social recommendation domain' that focuses on providing recommendations based on users social connections to each other, create domain datasets that fit this specific need;?perhaps through a 'graph representation of social network of users'.?While this graph dataset is useful for recommendation use case, it might be also useful for a 'listeners notifications' domain,?which provides data regarding different types of notifications that are sent to the listener, including what people in their social network are listening to.?So it is possible that 'user social network' can become a shared and newly reified domain dataset for multiple consumers to use.?The 'user social network' domain team focuses on providing an always curated and uptodate view of the 'user social network'.
The consumer aligned domain datasets have a different nature in comparison to source domains datasets.?They structurally go through more changes, and they transform the source domain events to aggregate views and structures that fit a particular access model, such as the graph example we saw above.?A domain oriented data platform should be able to easily regenerate these consumer datasets from the source.
Distributed pipelines as domain internal implementation
While the datasets ownership is delegated from the central platform to the domains, the need for cleansing, preparing, aggregating and serving data remains, so does the usage of data pipeline.?In this architecture, a data pipeline is simply an internal complexity and implementation of the data domain and is handled internally within the domain.?As a result we will be seeing a distribution of the data pipelines stages into each domain.
For example the source domains need to include the cleansing, deduplicating, enriching of their domain events so that they can be consumed by other domains, without replication of cleansing.?Each domain dataset must establish a?Service Level Objectives?for the quality of the data it provides: timeliness, error rates,?etc. For example our media player domain providing audio 'play clickstream' can include cleansing and standardizing data pipeline in their domain that provides a stream of de-duped near real-time?'play audio click events' that conform to the organization's standards of encoding events.
Equally, we will see that aggregation stages of a centralized pipeline move into implementation details of consuming domains.
One might argue that this model might lead to duplicated effort in each domain to create their own data processing pipeline implementation, technology stack and tooling.?I will address this concern shortly as we talk about the?Convergence of Data and Platform Thinking with Self-serve shared Data Infrastructure as a Platform.
领英推荐
Data and product thinking convergence
Distribution of the data ownership and data pipeline implementation into the hands of the business domains raise an important concern around accessibility, usability and harmonization of distributed datasets.?This is where the learning in applying?product thinking?and ownership of data assets come handy.
Domain data as a product
Over the last decade operational domains have built?product thinking?into the capabilities they provide to the rest of the?organization.?Domain teams provide these capabilities as APIs to the rest of the developers in the organization, as building blocks of creating higher order value and?functionality.?The teams strive for creating the best developer experience for their domain APIs; including discoverable and understandable API documentation, API test sandboxes, and closely tracked quality and adoption?KPIs.
For a distributed data platform to be successful, domain data teams must apply product thinking with similar rigor to the datasets that they?provide;?considering their data assets as their products and the rest of the organization's data scientists, ML and data engineers as their?customers.
Consider our example, internet media streaming?business.?One of its critical domains is the 'play events', what songs have been played by whom, when and?where.?This key domain has different consumers in the?organization;?for example near real-time consumers that are interested in the experience of the user and possibly errors so that in case of a degraded customer experience or an incoming customer support call can respond quickly to recover the?error.?There are also a few consumers that would prefer the historical snapshots of the daily, or monthly song play event?aggregates.
In this case our 'played songs' domain provides two different datasets as its products to the rest of the organization; real-time play events exposed on event streams, and aggregated play events exposed as serialized files on an object?store.
An important quality of any technical product, in this case domain data products, is to delight their consumers; in this case data engineers, ml engineers or data?scientists.?To provide the best user experience for consumers, the domain data products need to have the following basic?qualities:
Discoverable
A data product must be easily?discoverable.?A common implementation is to have a registry, a data catalogue, of all available data products with their meta information such as their owners, source of origin, lineage, sample?datasets,?etc. This centralized discoverability service allows data consumers, engineers and scientists in an organization, to find a dataset of their interest?easily.?Each domain data product must register itself with this centralized data catalogue for easy?discoverability.
Note the perspective shift here is from a single?platform extracting and owning?the data for its use, to each?domain providing its data as a product in a discoverable?fashion.
Addressable
A data product, once discovered, should have a unique address following a global convention that helps its users to programmatically access?it.?The organizations may adopt different naming conventions for their data, depending on the underlying storage and format of the?data.?Considering the ease of use as an objective, in a decentralized architecture, it is necessary for common conventions to be?developed.?Different domains might store and serve their datasets in different formats, events might be stored and accessed through streams such as Kafka topics, columnar datasets?might use CSV files, or AWS S3 buckets of serialized?Parquet?files. A standard for addressability of datasets in a polyglot environment removes friction when finding and accessing?information.
Trustworthy and?truthful
No?one will use a product that they can't trust.?In the traditional data platforms it's acceptable to extract and onboard data that has errors, does not reflect the truth of the business and simply can't be trusted.?This is where the majority of the efforts of centralized data pipelines are concentrated, cleansing data after ingestion.
A fundamental shift requires the owners of the data products to provide an acceptable?Service Level Objective?around the truthfulness of the data,?and how closely it reflects the reality of the events that have occurred or the high probability of the truthfulness of the insights that have been generated.?Applying data cleansing and automated data integrity testing at the point of creation of the data product are some of the techniques to be utilized to provide an acceptable level of quality.?Providing?data provenance and data lineage?as the metadata associated with each data product helps consumers gain further confidence in the data product and its suitability for their particular needs.
The target value or range of a data integrity (quality) indicator vary between domain data products.?For example, 'play event' domain may provide two different data products, one near-real-time with lower level of accuracy, including missing or duplicate events, and one with longer delay and higher level of events accuracy.?Each data product defines and assure the target level of its integrity and truthfulness as a set of SLOs.
Self-describing semantics and syntax
Quality products require no consumer hand holding to be used: they can be independently discovered, understood and consumed.?Building datasets as products with minimum friction for the data engineers and data scientists to use requires well described semantics and syntax of the data, ideally accompanied with sample datasets as exemplars.?Data schemas are a starting point to provide self-serve data assets.
Inter-operable and governed by global standards
One of the main concerns in a distributed domain data architecture, is the ability to correlate data across domains and stitch them together in wonderful, insightful ways; join, filter, aggregate,?etc. The key for an effective correlation of data across domains is following certain standards and harmonization rules.?Such standardizations should belong to a global governance, to enable interoperability between polyglot domain datasets.?Common concerns of such standardization efforts are field type formatting, identifying?polysemes?across different domains, datasets address conventions, common metadata fields, event formats such as?CloudEvents, etc.
For example in the media streaming business, an 'artist' might appear in different domains and have different attributes and identifiers in each domain.?The 'play eventstream' domain may recognize the artist differently to 'artists payment' domain that takes care of invoices and payments.?However to be able to correlate the data about an artist across different domain data products we need to agree on how we identify an artist as a polyseme.?One approach is to consider 'artist' with a federated entity and a unique global federated entity identifier for the 'artist', similarly to how?federated identities?are managed.
Interoperability?and?standardization of communications, governed globally, is one of the foundational pillars for building distributed systems.
Secure and governed by a global access control
Accessing product datasets securely is a must, whether the architecture is centralized or not.?In the world of decentralized domain oriented data products, the access control is applied at a finer granularity, for each domain data product.?Similarly to operational domains the access control policies can be defined centrally but applied at the time of access to each individual dataset product.?Using the?Enterprise Identity Management system (SSO)?and?Role Based Access Control?policy definition is a convenient way to implement product datasets access control.
Section?Data and self-service platform design convergence?describes the shared infrastructure that enables the above capabilities for each data product easily and automatically.
Domain data cross-functional teams
Domains that provide data as products; need to be augmented with new skill sets: (a) the?data product owner?and (b)?data engineers.
A data product owner makes decisions around the vision and the roadmap for the data products,?concerns herself with satisfaction of her consumers and continuously measures and improves the quality and richness of the data her domain owns and produces.?She is responsible for the lifecycle of the domain datasets, when to change, revise and retire data and schemas.?She strikes a balance between the competing needs of the domain data consumers.
Data product owners must define success criteria and business-aligned Key Performance Indicators (KPIs) for their data products.?For example, the lead time for consumers of a data product to discover and use the data product successfully, is measurable success criteria.
In order to build and operate the internal data pipelines of the domains, teams must include data engineers.?A wonderful side effect of such cross-functional team is cross pollination of different skills.?My current industry observation is that some data engineers, while competent in using the tools of their trade, lack software engineering standard practices, such as continuous delivery and automated testing, when it comes to building data assets.?Similarly software engineers who are building operational systems often have no experience utilizing data engineering tool sets.?Removing the skillset silos will lead to creation of a larger and deeper pool of data engineering skill sets available to the organization.?We have observed the same cross-skill pollination with the DevOps movement, and the birth of new types of engineers such as?SREs.
Data must be treated a foundational piece of any software ecosystem, hence software engineers and?software generalists?must add the experience and knowledge of data product development to their tool belt.?Similarly infrastructure engineers need to add knowledge and experience of managing a data infrastructure.?Organizations must provide career development pathways from a?generalist?to a?data engineer.?The lack of data engineering skills has led to the?local optimization?of forming centralized data engineering teams as described in section?Siloed and hyper-specialized ownership.
Data Engineer en Novakorp| AWS Certified
1 年Excellent article. Thanks:)
Technical Manager (Data Strategy & Architecture) at United Nations OCHA
2 年Great Article Bishoy, I'd like to add the following points regarding the discussed 1) Achieving analytical agility covering the entire data cycle from data/capture/exploration till accessibility/BI consumption endpoints would impose the need for a data semantic layer engine as a core component for data architecture management as a whole 2) Data governance (quality control at the heart of it) will have to acquire high degree of automation specially on data-semantic level, specially with high-frequency, large scale data streams (e.g. IOTs) ... which ML is an essential bedrock for that in regards of identifying anomalies within data logical-based interpretations (e.g. a data entry for a 6-feet tall NBA player in defense role is semantically non-valid) 3) As data mesh was born as a design methodology to acquire the benefits of both data fabric while maintaining the dynamics of data hubs within distributed digital ecosystems, a best practice would necessitate segmentation of data lifecycle activities is needed into 2 primary categories: a centralization-oriented (e.g. standards for data encryption management) and a decentralization-oriented ones (e.g. BI production cycle)