Data Mesh, Data Fabric and Data Lake Architecture - driving data products innovation in clinical research
Executive Summary:
In this world of big data and digital economy, every company is a data company, hence every company is trying to be data-driven by unleashing the power of data through data transformation initiative by building data products, services and applications in order to gain insights, make predictions, generate revenue, optimize costs, mitigate risk, improve compliance, and enhance decision-making.
The lifecycle of “data product” mirrors standard product development: you identify the opportunity to solve a core user need; you build an initial version, and then you evaluate its impact and then improve it via iteration. But the data component adds an extra layer of complexity. To tackle the challenge, companies must emphasize cross-functional collaboration, evaluate and prioritize data product opportunities with an eye to the long-term, and start simple.
Data Product is a team sport. Every company realizes that data is a strategic differentiator. Data is the new oil. Whoever gets data right creates a competitive advantage in the marketplace and stands out in the competition.
In this data-driven digital world, data is the king.
Getting data right, however, is no small task. As you will find out in this article, it's a journey that has ups and downs, ebbs and flows, just like a team that goes through the same during playing sport.
Had getting data right been an easy undertaking, every company would have been successful at doing it, mainly at building smart data products, and there are good reasons for it.
Data integration is complex because data sources are increasingly becoming complex.
Data variety is too many - some sources are structured and clean; some data sources are unstructured and require NLP techniques to mine information from; some data is binary such as images, scans, etc that require sophisticated AI algorithm including Neural Network to process.
Data volume is too large - this is forcing all of us to think differently as to how to store it, where to store it and how to process it - architecturally speaking.
Data velocity is too high - data from mobile devices, wearables and other biosensors to gather and store huge amounts of health-related data has been rapidly accelerating.
Data Veracity is too much. Data quality of the sources often times are not adequate, to say the least.
Data sources are siloed. as they were built stand alone with no interoperability in mind. Many legacy applications don't even support APIs for data exchange. Since interoperability is virtually non-existent, there exists no standard data semantics for data exchange. Matching and merging are the primary ways to connect disparate data sources because in most situations , there is lack of common identifier.
Data Privacy regulations are relentless. Then, when we add data privacy laws and regulations around PHI and PII (GDPR, CCPA, HIPAA, 21 CFR Part11, etc), the challenges of data integration go to a different level. The architecture must support data privacy, security, auditability provenance, masking and anonymization.
Therefore, a series of steps, some of which are process-related, some of which are data governance related, some of which are data privacy and provenance related and some of which are technology and architecture related, must work hand-in-hand to get data right before value from data can be realized at scale and put to use through smart data products, data services and applications.
Key data enablers to get data right - a practitioner's perspective from the trenches of innovation
While the challenges seem insurmountable, there are various enablers, when executed right, can lead to successful execution of enterprise data transformation initiative, enabling the final goal of getting data right for innovation.
lets discuss those enablers.
Infrastructure and pipeline - Just like crude oil increases its value through continuous refinement, same is true of data. In order to extract value from data, we also need scalable data storage, data refinery, data pipeline, data delivery all while having data quality and data governance. Additionally, we also need to secure the data field just like the way we secure oil field with proper security access control and oversight.
Technology, architecture and others - Modern technology and architecture is just one of many such enablers. Other enablers are business process modernization, data culture, stakeholders alignment, data ownership, data accountability, data governance and data stewardship. Other key equally important enablers are actionable enterprise data strategy that is in full alignment with business strategy, data products strategy and corporate goals.
Data maturity is also a key enabler to get data right. The data maturity includes, among other things, data understanding, stakeholder alignment behind the data transformation journey, data stewarding, data ownership, understanding data rights from compliance and regulatory perspectives before building data products and services for data monetization, data engineering capabilities, modern data architecture skills, AI/ML skills, cloud skills, just to name a few.
To summarize the data integration problem, we need both business goals and data maturity as part of our comprehensive enterprise data strategy, as depicted below. This approach will help us get data right.
Our Holistic approach to Data Transformation - enabling data products innovation in clinical research and precision medicine:
At ConcertAI, we are focusing relentlessly on building scalable data products, data services and data applications that are revolutionizing oncology clinical research and precision medicine.
All of our data products are built on modern data platform in the cloud using techniques of modern data architecture following principles of software development and best practices with scalability, performance, usability, data security, data provenance, data auditability and data privacy in mind.
We follow a methodical approach. In our clinical research domain, we deal with patient information hence our data products are based on highly curated and integrated data of highest quality.
We enable data in a multitude different ways. The following picture depicts a conceptual view of it. This is for illustrative purpose only.
Our data products integrate disparate data sources, collectively called real world data (RWD) sources.
RWD relates to patient health status and/or the delivery of routine health care collected from a variety of sources. Real-World Evidence (RWE) is the clinical evidence about the usage and potential benefits or risks of a medical product derived from analysis of RWD.
RWD sources include electronic health records (EHRs), medical claims and billing data, data from product and disease registries, patient-generated data, including from in-home-use settings (e-diaries, smart glucose monitor device, smart blood pressure monitoring device, etc) and data gathered from other sources that can inform on health status, such as mobile devices, applications and social media.
RWD sources can be used for data collection and, in certain cases, to develop analysis infrastructure to support many types of study designs to develop RWE (real world evidence), including, but not limited to, randomized trials (e.g., large simple trials, pragmatic clinical trials) and observational studies (prospective or retrospective).
Our data products are utilizing these RWD data sources and building data products, data services and data applications for pharmaceutical companies that are cutting down costs on patient screening, patient recruitment and retention and site selection for oncology, which in turn, are helping pharmaceutical companies bring life-savings drugs for the patients faster and cheaper.
Our modern data platform is enabling us to innovate faster.
Below is a typical conceptual data flow within our modern data platform as data flows from sources as data producers through to the modern data platform to all the way to various data consumers.
The data is governed end-to-end, curated, and quality-checked for incompleteness an quality.
Challenges and Opportunities of RWD data Integration and the need for a Modern Data Platform:
The primary challenge in working with EHR-derived RWD data is that critical patient information is often buried only in unstructured documents such as clinical notes and pathology reports, making it difficult to extract and analyze key outcomes of interest. Inclusion and exclusion criteria such as those based on histology, gene altercations and metastatic status, are reliably found only in those unstructured documents.
Often, an effective approach to leverage this unstructured data for cohort selection is to pair AI-driven technology, primarily Natural Language Processing (NLP), with human review.
NLP is able to read and comprehend doctors’ notes, pathology reports and other texts from EHR systems, as well as optical technologies.
领英推荐
NLP reads radiological scans and other images at very high accuracy and speed. Our collective dataset is of high volume, variety and veracity. Data types are widely varied - structured, unstructured, semi-structured and binary (images, CT scans, etc).
Other data integration challenges include data interoperability, data quality, inconsistent metadata across systems, vendor lock in, non-availability of data standards for data exchange such as APIs, etc. The data is also very complex - some are structured, some are unstructured, some are binary and some are semi-structured. Such variety and veracity within the RWD data landscape pose additional data integration challenges from architecture and technology perspective.
Our modern cloud-based scalable data platform, founded on core data architecture principles, perform superior data ingestion, curation and integration at scale. Our high quality data, along with AI/ML, forms the basis of all of our data products and data applications.
Against this backdrop, we discuss 3 modern and emerging data architecture design patterns - Data Lake, Data Mesh and Data Fabric, for scale, performance, automation and operational excellence.
Data Lake Architecture Overview:
A Data Lake is a storage repository that can store large amount of structured, semi-structured, and unstructured data. It is a place to store every type of data in its native format with no fixed limits on account size or file. It offers high data quantity to increase analytic performance and native integration.
Data Lake is like a large container which is very similar to real lake and river. Just like in a lake you have multiple tributaries coming in, a data lake has structured data, unstructured data, binary data, etc, flowing through in real-time, or batches or micro-batches. The trick is to keep the data lake clean so it does not become a data swamp over time.
This is where data governance, data quality management, data ownership and data stewardship come in.
Data culture plays a big role too.
The hardest thing is executing data governance and data security around the data inside the data lake over sustained period of time. Data Catalog and metadata management play a pivotal role to making this happen as they help data discovery for data consumers. Data authorization based on roles and data policies also are very critical.
Data Mesh Architecture Overview
Just the same way that software engineering teams have transitioned from building monolithic applications to microservice-based architectures to keep each component loosely-coupled and independent, the data mesh architecture pattern provides the same level of benefits - modular, domain-driven and distributed so each component can be evolved independently. The data mesh platform architecture embraces the ubiquity of data by leveraging a domain-oriented and self-serve design. The tissue that connects these domains and their associated data assets is a universal interoperability layer that applies the same syntax and data standards, driven by master metadata management and master data management.
Functioning similar to a service mesh, a data mesh creates a layer of connectivity that abstracts away the complexities of connecting, managing and supporting access to data. At its core, it is used to stitch together data held across multiple data silos. The premise of a data mesh is that it is used to connect distributed data across different locations and organizations.
A data mesh ensures that data is highly available, easily discoverable, secure, and interoperable with the applications that need access to it.
Data Mesh architecture embraces "data-as-a-product" mindset, which ties back to all the enablers of success we talked about at the beginning.
At a high level, a data mesh is composed of three separate components: data sources, data infrastructure, and domain-oriented data pipelines managed by functional owners. Underlying the data mesh architecture is a layer of universal interoperability, reflecting domain-agnostic standards, as well as observability, auditability and data governance.
Data Fabric Architecture Overview
This is another emerging data architecture pattern that encourages a single unified data architecture with an integrated set of technologies and services, designed specifically to deliver integrated and enriched data – at the right time, in the right method and to the right data consumer in support of both operational and analytical workloads. This architecture pattern combines key data management technologies such as??data catalog, data governance, data integration, data pipelining and data orchestration.
At ConcertAI , we employ both data mesh and data fabric design patterns that complement our existing data warehousing and data lake patterns. For us, it's not a either/or debate, instead about employing the right architecture and toolsets to solve the right use cases with data findability, accessibility, interoperability and reusability in mind.
Data Architecture and Quality attributes of a Modern Data Platform
A modern data platform must have the following quality attributes, data components and data services:
Data catalog, master data management and metadata management - This platform service is to classify and inventory data assets, and represent information supply chains visually. Metadata management is also a piece to this - standard data definitions, standard semantics, data lineage rules, among other things, that fall under this big umbrella. As important as Master Data Management is related to the management of master entities that run the business (customer, patient, provider, contract, to name a few), metadata management is equally important, particularly when data is coming from a myriad of sources.
Scalable Data Pipeline - This platform service, which is usually a collection of well-designed scalable data pipeline services along with orchestration and workflow, is to build reliable and robust data pipelines for both operational and analytical use cases.
AI/ML/Data Science Sandbox - The platform must support data scientists by providing high quality integrated dataset for AI/ML models. 80% of time spent in AI/ML modeling goes to data preparation including data wrangling and integration. If the platform can streamline data preparation, data science team can solely focus on modeling tasks.
Self-service Business Intelligence: The platform must have self-service capability. data can be served up to consumers in many ways, but the platform needs to publish metadata of the data lake, for example, so consumers know what data exists, how that data is processed, what business rules are applied along the way, who owns this data, etc. This metadata-driven platform enables self-service and enterprise adoption of BI and AI.
DevOps, SecOps and MLOps - The platform must have automated CI/CD (Continuous integration and continuous delivery) pipeline. Security should also be a part of any development. Once AI/ML models are created by data scientists, the model parameters must be integrated into the data pipeline for production usage through modern cloud-based MLOps techniques for efficiency, monitoring and continuous retraining of models. The learnings of the models must be recycled back into the data for continuous improvement. Data hops must be minimized wherever possible. Each time data hops from one environment to another or one data store to another, data quality issues are introduced, resulting more QC time and compromising data delivery. Data Pipeline must be automated end to end through automated testing, workflow management, notification, orchestration, etc.
Data Security by Design - These services are to support data security. Data must be encrypted and masked to meet data privacy regulations - encryption at rest and encryption in motion. Data security by design is the best way to achieve data security so that data security does not become an afterthought. It's a part software development.
Data governance by Design - The platform must assure quality, comply with privacy regulations, and make data available – safely and at scale. Data architecture review board plays a big role here. Data Governance is best achieved by forming a cross-functional team with data owners and data stewards involved across business lines including SMEs and domain experts. This is critical to maintaining quality of data over time within the platform, aligned with data privacy laws.
Data orchestration by Design - These platform services are to define the data flows from source to target, including the sequence of steps for data cleansing, transformation, masking, enrichment, and validation
Scale and performance by Design - These services are to dynamically scale both up and down, seamlessly, no matter how large the data volume and its processing. Support both operational and analytical workloads, at enterprise scale.
Data Accessibility by Design - These services are to support all data access modes, data sources, and data types, and integrate master and transactional data, at rest, and in motion. Ingest and unify data from on-premise and on-cloud systems, in any format – structured or unstructured. The platform logical access layer needs to allow for data consumption, regardless of where, or how, the data is stored, or distributed – so no in-depth knowledge of underlying data sources is necessary.
Data Delivery at scale by Design - This also ties back to self-service in some way. Data within the platform must be findable, accessible, interoperable and reusable. These platform layer should support APIs, both for internal and external users. Depending on the use cases served, this delivery channel may create value-added data marts, for example, for data science teams. This layer may also contain embedded analytics or enable enterprise BI. All delivery channels must be monitored for scale, performance, security and consumption. These platform services are to retrieve data from any source and deliver it to any target, in any method: ETL (bulk), messaging, CDC, virtualization, and APIs
Conclusion:
Understanding and acknowledging the importance of data is a vital step for a company, but what happens next can be a challenge in terms of how to realize data's importance - converting data into data products, data services and data applications for monetization.
We discussed several "data enablers" and "data integration challenges" that need to get done right for the enterprise data transformation initiative to be successful.
From practitioner's perspective, while the data transformation journey may look complex and at times, insurmountable, our experience tells us that when best practices and tried and true approaches as laid out in this article are followed, practiced and employed while architecting and building data platform, data products and data applications, chances are that your data transformation journey will be smooth and successful.
Remember, data products are like wine - they get better with age, so start small and iterate with continuous improvement and quality in mind.
Take your first step towards your data transformation journey just like any journey starts with taking the first step. You can do it.
Senior Director, Enterprise Business Systems at Quickbase
2 年Great article, Santi. Very comprehensive overview of data journey and the analytics community. would love to read further thoughts on data governance and self-service BI.
IIM Indore MBA '25|CDM Expert|Data Science| Clinical Trial Optimisation| AI in Healthcare | Delivery | Strategy & Solutions | Digital Transformation | RWE| Operational Excellence | Leadership & People development
2 年Prakriteswar Santikary, PhD All in One Article..Mini MBA on the Data Journey
Head of Data Governance and Quality
2 年Great article! Agree with all the points listed here! I am going to share with my network, if it is ok?