Thank you for reading my latest article here.
Here at LinkedIn, I regularly write about data architecture, Business Architectures Business Concepts and technology trends. To read my future articles simply join my Newsletter on LinkedIn or click 'Follow'. Also feel free to connect with me via LinkedIn.
Data Quality is often a topic of interest and debate. There are multiple schools of thoughts around how, when, where and with what Data Quality Design and Implementation should be done. There are many Open Sourced and/or Licensed Platforms and Frameworks out there which will help you design your Data Quality Solution. But it is important to know from outset what are your Data quality requirements, what capabilities you need to build and what are your Design Principles. Please keep in mind that standing up Data Quality solution can be a costly undertaking for organizations which already have lot of Production workloads and heavy data footprints. Whatever approach you choose should be holistic, scalable and flexible for future expansion.
I want to talk about some of those key points in this article.
- Support for multiple Lines of Business: You may start initial Design and Implementations with specific Line of Business, this Framework should be easily adaptable by other Lines of Businesses within the Enterprise.
- Support for Multiple Applications, Systems of Records and Data Domains: Each application, SOR and Data Domain may have different need for Data Quality, SLAs, Reporting and notifications. Design should be seamless enough.
- Support for Hybrid Architectures: DQ Framework should be agnostic to whether it is cloud Native, On-Prem or Hybrid Data Design.
- Support for Rule Mastering: Support for Rule Mastering involving Technology and Data Management teams.
- Support for Configuration Driven Development: Core Rule engine should be generic enough and new DQ Rule development and onboarding should be configuration driven.
- Support for Notifications: Support for multiple mechanisms for notifications to Data Management, Technology, Applications and Business Stakeholders.
- Support for Incident Management: Any remedial workflow needs, integration with Incident Management.
- Support for Automation, Orchestration and Scheduling: All DQ Process, executions should be automated.
Now let's talk about each of these points:
- Data Quality Rule Mastering: Design should support DQ Configurations per Data Domain, Data Source and Data sets. It should support simple Technology Friendly configurations such as JSON, XML etc. Alternatively, Data Management and Data Stewards should have easy to use Web UI to enter, review or authorize/approve DQ configurations. This will help abstract them from underlying configuration format. Behind the scenes data entered by this Web Application, should still be converted to this standard Configuration format. These configurations should trigger automated CI/CD pipelines which helps load translated DQ Rules in DQ Repository.
- DQ Rules Repository: Design should support Cloud native, On-prem or Hybrid solution. It should be compatible to all 3 major cloud providers like AWS, GCP and MS Azure. You can as well choose to use Cloud Native Data Platforms like Snowflake and Databricks. So essentially design should have support for any NoSQL, Relational, Lakehouse and Data Lake design patterns. Few choices you may have been SQL Server, Oracle, AWS RDS, AWS Athena, AWS Redshift and so on.
- Rule Engine: Rule engine should be at the Core of your DQ Design. Think of this as an application or Services which accepts DQ Rules from Repository and input data (either during the Data Pipeline or after the fact once it is loaded into desired destination) and generate output and DQ Exceptions. Rule Engine will have multiple downstream integrations for Incident Management, DQ Notifications, DQ Portals/Dashboards, Downstream workflows and Business Intelligence/Analytics on DQ Metrics.
- DQ Exception Repository: DQ Exceptions as part of DQ Execution Results can be stored separately in a repository. This can be physically or logically (Schema) stored separate from DQ Rules Repository.
- Incident Management: Many organizations have their preferred Incident Management platform to create, track, manage Business Data incidents at various levels of severity and accordingly, SLA expectations for closure of those incidents can be managed. DQ Rule Exception generated from Rule Engine should leverage this enterprise Incident Management platform rather than creating a new solution altogether. In order to achieve that you need some solution to do continuous integration and detection of logs, events, metrics and DQ results. There are many solutions in the market for this capability, like, SumoLogic, Datadog, AppDynamics and so on. Enterprise Incident platform will ensure that incidents are created in correct queue and assigned to correct teams for SLA based resolution and notification.
- Notifications: Incidents are created in case of DQ Exceptions, but under normal circumstances, you still need to do notifications on DQ Executions and their results. Design should support various common notification mechanisms such as Emails, Collaboration channel notifications (Like MS Teams, Slack etc..) and application alerts for Application users. Notifications need not necessarily have to provide all the details , but rather should point them to place where Execution details and results are stored. Notification may just provide high level summary and status.
- DQ Portal: Dedicated DQ portal/Dashboard should be build serve the DQ Execution Results, their history and result attribution. Visibility should be based on level of access. DQ Results should be served using DQ APIs to be consumed into Portal.
- DQ Results Repository: Just like DQ Rules and DQ Exception you will need DQ Results repository to store Execution details and results for DQ Rules. It should also store long term history of DQ Results. Just like other repositories it can be physically or logically separate repository. Visibility should be based on access levels.
- Business Workflow: As DQ Framework is part of larger Data Ecosystem, it is key to have automated orchestration to trigger either Happy path or Remediation/Exception workflows, upon successful DQ Engine execution.
- Business Intelligence: Data Management team should be provided with some Business Intelligence, Analytics, Visualization, Dashboarding capability on DQ Metrics and their long-term history. This will help them understand DQ patterns and take necessary continuous steps for Data Quality Governance.
To stay up to date with my latest articles in, make sure to subscribe to my newsletter
follow me on LinkedIn
, and if you’re interested in taking a deeper dive into some of these topics, please feel free to reach out to me.
#dataquality #dataqualitygovernance #hybridarchitecture #configurationdrivendq