登录查看更多内容

"From Source to Insight: Why Data Quality Remains Elusive" – Part 1

Srinivasan Venkatesan

发布日期: 2024年11月1日

I am starting to think more deeply about Data Quality after some very frequent encounters /challenging situations related to suboptimal data quality performance. These issues often lead to user frustration and can overshadow the valuable work of data engineers.

As a Six Sigma Master Black Belt, I have a deep passion for quality, specifically in software and data. Added to that, I have a couple of decades experience in software Dev/Quality and close to a decade in Data Engg/Data Quality. Throughout my career in software development, I have faced difficult situations involving software quality. Applying structured problem-solving methods has often helped isolate root causes—many of which are within the control of immediate or adjacent teams—making it possible to implement meaningful improvements.

However, when it comes to data quality, the solutions are often not as straightforward as they are for software quality. I plan to share my experiences in small segments over the next several weeks and months to educate as well as explore and learn from my readers.

This week, I am going to touch on three common challenges, I have encountered while delivering Data Platforms. There are many more which I will cover in subsequent blogs.

1. Heterogeneous Data Sources

Data engineering frequently involves integrating data from a variety of heterogeneous sources. This process requires complex transformations and mapping, which inherently carry a high risk of discrepancies and errors. Each data source may have different formats, standards, and quality levels, making it difficult to ensure consistency across all datasets.

·?????? Complex Transformations:?Aligning disparate data requires intricate transformation logic to reconcile differences in schema, units of measure, and data types.

·?????? Increased Error Risk:?The more transformations that data undergoes, the higher the probability of introducing errors or inconsistencies.

·?????? Quality Variability:?Data originating from external sources may not adhere to the same quality standards, further complicating integration efforts.

2. The Creator vs. User Conundrum

Data often undergoes several transformations throughout its lifecycle, and typically, the creators of the data have little to no visibility into how it will be used downstream. This lack of transparency increases the complexity of data transformations and makes it extremely challenging to maintain reliability in data pipelines.

·?????? Lack of Communication:?Without effective communication between data producers and consumers, assumptions are made that may not hold true across different use cases.

·?????? Diverse Requirements:?Downstream users might have specific needs that were not anticipated by the data creators, leading to gaps in data suitability.

Data & Analytics 5 个月前

Implementing All Four Aspects of Data Quality

OvalEdge 2 个月前

What is Data Lineage?

Jose Almeida 9 个月前

·?????? Alignment Challenges:?Bridging the gap between data creation and usage requires robust data governance and collaboration mechanisms.

3. Testing Difficulties

In software engineering, it is generally possible to test the most risky scenarios using well-designed test strategies and cases. Feature-driven testing is fairly reliable and consistent because software behavior can be predicted and validated against expected outcomes. However, in data engineering, the data itself is often inconsistent for various reasons.

·?????? Data Variability:?Unlike static code, data changes over time, and new data can introduce unexpected scenarios that weren't present during initial testing.

·?????? Delayed Issue Discovery:?Many critical data quality issues surface months or even years after deployment, when specific data or business scenarios are encountered.

·?????? Testing Limitations:?It's impractical to anticipate and test every possible data permutation, especially with large volumes and varied sources.

Conclusion

Achieving high data quality is a complex and multifaceted challenge that extends beyond technical solutions. From analyzing the challenges in this article, there are a couple of areas, we should focus on, as Data Engineers and Product Managers delivering Data Projects.

1) Data Engineering should be a critical voice to reducing systems complexity: In large legacy organizations, systems were designed for specific workflows, often siloed, with no specific intent to drive insights. Organizations always have plans to modernize these complex heterogenous sources, but sometimes, the prioritization of this refactoring effort does not include input from the Data Engineering teams. There is hidden cost in organizations particularly around Data Engineering and if Data Engineering leaders have a voice on the table recommending the right refactoring ideas to reduce systems complexity, that can gradually overcome the challenges of heterogenous data sources.

2) Creator-User Close Collaboration:?Fostering communication between data creators and users to ensure alignment of expectations and requirements. This is easier said than done as all users will always not be known at the time of launching Data platforms. Taking? a persona driven approach to look at the data domains and potential personas who are likely to be users of that domain is one step to bridge that gap.

3) Test Driven Data Engineering:?Often times, Data Engineers may not fully understand how data is planned to be used or specific insights being sought after and so their testing focuses mostly on sample based testing that looks for completeness and timeliness. That alone is not enough and Engineers should push themselves to look for usage scenarios they need to keep in mind. This is where a product mindset to Data Projects is very important.

This week, I will stop with these three challenges and suggestions and I look forward to feedback from my readers so that I can learn from your experiences. Please do comment and feel free to challenge my thoughts so that I can benefit from your insights and wisdom.

Data Quality is elusive

666 位关注者

Sekar J. Daniel

Director - Strategy and Business Transformation

1 周

Thank you! This is a great initiative for fostering collaborative learning. However, challenges like the lack of end-to-end visibility of data and its flow, the overwhelming variety of available technologies without standardized best practices, and the absence of a single source of truth persist. Modern-day data pipelines resemble the early days of stored procedures. File systems, DBMS, RDBMS, and warehouses now align with cloud hot and cold storage, and the number of tools and services available for extracting, transforming, and loading data is vast, offering numerous options for the future.

Suresh Sethuraman

Delivery Head | MBA, PMP, CISA, 6sigma

1 周

Good article Venky. Another area of focus you may consider for data accuracy and consistency is Data Quality Audits

Balaji Chinnathambu

Thought Leader in Data Engineering & Management | Mentor | Data Scientist | Machine Learning

3 周

Great Write up Venky! Looking forward to more from this series.

S SAIDHA MIYAN

Aspiring Corporate Director / Management Consultant / Corporate Leader

3 周

Thanks for inviting, sharing an informative-insightful article, & Best wishes, Srinivasan Venkatesan. Best wishes, to Bill Ford, Executive Chairman, Jim Farley, CEO, and 'Team Ford Motor Company, Ford Pro, #Ford Group', for all your future endeavors, and to achieve, many more, milestones! Syed Awees.ACCA. Syed Suheb.

Cynthia Gumbs

Driving Customer Success | Data & Analytics Leader | MCWT AB Member | Mentor

3 周

Great insights, Venky! Legacy systems are not designed with todays data disciplines in mind and that contributes to the complexities you mentioned. How might we use the new tech to solve old tech problems? I look forward to more from this series!

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

"From Source to Insight: Why Data Quality Remains Elusive" – Part 1

Srinivasan Venkatesan

领英推荐

Data Quality is elusive

666 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

Building User-Centric Data Products

Building Tomorrow's Insights Today: Fission Labs Data Engineering Services

Standardizing Data Delivery with Data as a Product

Driving better insights through better data - What, where and how.

Creating a Data Culture: The Role of Data Interfaces

Unlocking Data Quality Excellence

Visualizing the Data Journey with Lineage

Data Plumbing Essentials: Production Pipelines

Unlock the power of data transformation!

Scaling Data Systems: "4S" framework.

领英推荐

Data Quality is elusive

666 位关注者

A Dark Day in Indian Cricket: When the Stars Dimmed

2024年11月3日

Week-2 IPL 2024- A raw pacer, an IPL legend, a comeback star and an emerging star

2024年4月6日

Paaji - India's first cricketing Revolutionary

2023年11月4日

Chennai's knowledgeable cricket crowds

2023年10月28日

Neeraj and his inevitable victory

2023年10月7日

Friendship and Leadership

2022年4月9日

Leaving with grace

2022年3月26日

Cricket's greatest showman and King- Gone too young

2022年3月12日

Lessons from the smiling Boom Boom Bumrah

2022年3月1日

The Pain is heartbreaking, let's learn the lessons from it

2021年4月25日

社区洞察

其他会员也浏览了

Building User-Centric Data Products

Building Tomorrow's Insights Today: Fission Labs Data Engineering Services

Standardizing Data Delivery with Data as a Product

Driving better insights through better data - What, where and how.

Creating a Data Culture: The Role of Data Interfaces

Unlocking Data Quality Excellence

Visualizing the Data Journey with Lineage

Data Plumbing Essentials: Production Pipelines

Unlock the power of data transformation!

Scaling Data Systems: "4S" framework.