"From Source to Insight: Why Data Quality Remains Elusive" – Part 1
I am starting to think more deeply about Data Quality after some very frequent encounters /challenging situations related to suboptimal data quality performance. These issues often lead to user frustration and can overshadow the valuable work of data engineers.
As a Six Sigma Master Black Belt, I have a deep passion for quality, specifically in software and data. Added to that, I have a couple of decades experience in software Dev/Quality and close to a decade in Data Engg/Data Quality. Throughout my career in software development, I have faced difficult situations involving software quality. Applying structured problem-solving methods has often helped isolate root causes—many of which are within the control of immediate or adjacent teams—making it possible to implement meaningful improvements.
However, when it comes to data quality, the solutions are often not as straightforward as they are for software quality. I plan to share my experiences in small segments over the next several weeks and months to educate as well as explore and learn from my readers.
This week, I am going to touch on three common challenges, I have encountered while delivering Data Platforms. There are many more which I will cover in subsequent blogs.
1. Heterogeneous Data Sources
Data engineering frequently involves integrating data from a variety of heterogeneous sources. This process requires complex transformations and mapping, which inherently carry a high risk of discrepancies and errors. Each data source may have different formats, standards, and quality levels, making it difficult to ensure consistency across all datasets.
·?????? Complex Transformations:?Aligning disparate data requires intricate transformation logic to reconcile differences in schema, units of measure, and data types.
·?????? Increased Error Risk:?The more transformations that data undergoes, the higher the probability of introducing errors or inconsistencies.
·?????? Quality Variability:?Data originating from external sources may not adhere to the same quality standards, further complicating integration efforts.
2. The Creator vs. User Conundrum
Data often undergoes several transformations throughout its lifecycle, and typically, the creators of the data have little to no visibility into how it will be used downstream. This lack of transparency increases the complexity of data transformations and makes it extremely challenging to maintain reliability in data pipelines.
·?????? Lack of Communication:?Without effective communication between data producers and consumers, assumptions are made that may not hold true across different use cases.
·?????? Diverse Requirements:?Downstream users might have specific needs that were not anticipated by the data creators, leading to gaps in data suitability.
领英推荐
·?????? Alignment Challenges:?Bridging the gap between data creation and usage requires robust data governance and collaboration mechanisms.
?
3. Testing Difficulties
In software engineering, it is generally possible to test the most risky scenarios using well-designed test strategies and cases. Feature-driven testing is fairly reliable and consistent because software behavior can be predicted and validated against expected outcomes. However, in data engineering, the data itself is often inconsistent for various reasons.
·?????? Data Variability:?Unlike static code, data changes over time, and new data can introduce unexpected scenarios that weren't present during initial testing.
·?????? Delayed Issue Discovery:?Many critical data quality issues surface months or even years after deployment, when specific data or business scenarios are encountered.
·?????? Testing Limitations:?It's impractical to anticipate and test every possible data permutation, especially with large volumes and varied sources.
Conclusion
Achieving high data quality is a complex and multifaceted challenge that extends beyond technical solutions. From analyzing the challenges in this article, there are a couple of areas, we should focus on, as Data Engineers and Product Managers delivering Data Projects.
1) Data Engineering should be a critical voice to reducing systems complexity: In large legacy organizations, systems were designed for specific workflows, often siloed, with no specific intent to drive insights. Organizations always have plans to modernize these complex heterogenous sources, but sometimes, the prioritization of this refactoring effort does not include input from the Data Engineering teams. There is hidden cost in organizations particularly around Data Engineering and if Data Engineering leaders have a voice on the table recommending the right refactoring ideas to reduce systems complexity, that can gradually overcome the challenges of heterogenous data sources.
2) Creator-User Close Collaboration:?Fostering communication between data creators and users to ensure alignment of expectations and requirements. This is easier said than done as all users will always not be known at the time of launching Data platforms. Taking? a persona driven approach to look at the data domains and potential personas who are likely to be users of that domain is one step to bridge that gap.
3) Test Driven Data Engineering:?Often times, Data Engineers may not fully understand how data is planned to be used or specific insights being sought after and so their testing focuses mostly on sample based testing that looks for completeness and timeliness. That alone is not enough and Engineers should push themselves to look for usage scenarios they need to keep in mind. This is where a product mindset to Data Projects is very important.
This week, I will stop with these three challenges and suggestions and I look forward to feedback from my readers so that I can learn from your experiences. Please do comment and feel free to challenge my thoughts so that I can benefit from your insights and wisdom.
Director - Strategy and Business Transformation
1 周Thank you! This is a great initiative for fostering collaborative learning. However, challenges like the lack of end-to-end visibility of data and its flow, the overwhelming variety of available technologies without standardized best practices, and the absence of a single source of truth persist. Modern-day data pipelines resemble the early days of stored procedures. File systems, DBMS, RDBMS, and warehouses now align with cloud hot and cold storage, and the number of tools and services available for extracting, transforming, and loading data is vast, offering numerous options for the future.
Delivery Head | MBA, PMP, CISA, 6sigma
1 周Good article Venky. Another area of focus you may consider for data accuracy and consistency is Data Quality Audits
Thought Leader in Data Engineering & Management | Mentor | Data Scientist | Machine Learning
3 周Great Write up Venky! Looking forward to more from this series.
Aspiring Corporate Director / Management Consultant / Corporate Leader
3 周Thanks for inviting, sharing an informative-insightful article, & Best wishes, Srinivasan Venkatesan. Best wishes, to Bill Ford, Executive Chairman, Jim Farley, CEO, and 'Team Ford Motor Company, Ford Pro, #Ford Group', for all your future endeavors, and to achieve, many more, milestones! Syed Awees.ACCA. Syed Suheb.
Driving Customer Success | Data & Analytics Leader | MCWT AB Member | Mentor
3 周Great insights, Venky! Legacy systems are not designed with todays data disciplines in mind and that contributes to the complexities you mentioned. How might we use the new tech to solve old tech problems? I look forward to more from this series!