The Gender problem- Why Data Quality is important

The Gender problem- Why Data Quality is important

I have been playing with data since 1999 when I was first introduced to the cool CLI-based database management system called dBase.? I fell in love with databases since then. It was so neat and organized. Unlike today, being in the 90's I didn’t have a PC in my home back then so I used a local computer institute as my playing ground. I wrote countless lines of codes, forms and built data management systems on the dot (.) prompt [dBase interface has a command-line dot (.) prompt].?

No alt text provided for this image

It was all ethereal, and then I saw the problem.?

All the neatness and organization of data into the system came crashing down when some data entry team messed up the data into the system I designed. I created a program for a local merchant to consolidate his sales within the state and recommend stock re-inventory orders. One fine day, the data entry person screwed up the order and inserted names in the state field. It messed up the entire form and the final report. It took me days to fix the whole thing (you would have guessed, it’s not easy to work with a command-line system).? To ensure incidents like these shouldn’t mess up the data again, I went ahead and created a staging system where I wrote lots of data quality checks to make sure we are not messing up the data in the target database.?

And, on that day, I took a vow to make it my life’s mission to never compromise on Data Quality ever again. I have been on my mission ever since to help fix data quality wherever I can, one system at a time.??

Now, the real story. Fast forward, 2015, I was working as a data quality specialist for a renowned bank in Abu Dhabi (UAE).?

The bank was trying to collect data from the various legacy, heterogeneous systems in one place. This consolidated database was the entry point for another system designed for master data management [a story for another time].

The reason I landed in Abu Dhabi was also very interesting.?

The Data Engineering team was working under tight deadlines and had delivery pressure. Thus, the Delivery squad decided to let go of a few activities to accomplish this task within the timelines. One such activity was skipping data quality checks for less-critical data columns.?

Everything was going as per plan until one fine day, the COO office requested a comprehensive report on gender diversity.?

All hell broke loose when they received this report the next day:

No alt text provided for this image

Instead of this:

No alt text provided for this image

Turns out that the gender column data was classified as “less-critical” for data quality checks. Everyone took this column for granted and processed it directly into the target system (what worse can happen to gender, right ?) but they forgot that they are pulling data from legacy source systems that didn’t have uniform data standards. So, the gender column started getting data which was something like this:

No alt text provided for this image

Hence, the chaos in the report, which resulted in endless escalations and war-room sessions.?

I was called in as a data quality expert to test the whole system as a black box focusing majorly on data quality from both technical and logical aspects.?

Over the period of a few months, I and my team wrote numerous data quality checks and transformation rules. We worked with business and legacy system owners to understand the originating data and leveraged this knowledge to create a robust data quality framework.?

The downside of this effort was a delay in the overall delivery of the solution and the upside was a stable and a high standard solution. This paved the way for the smooth and flawless execution of the master data management (MDM). The MDM project was successfully completed within record time.?

The end result is so important that any method, even an extensive & time-consuming one, should be used to achieve it. I strongly believe that the end justifies the means and we should do anything to get it done.?

?What we learned:

  • Data Quality is the most important aspect of any successful data-based system.?
  • There are no shortcuts to perfect data quality.
  • Real-time systems are chaotic and unpredictable.
  • Data quality issues are injected or left unattended mostly by the human factor.


Thanks Adarsh - many of us have experienced the pain you describe. Whether it's data in mixed units (temperatures, concentrations, masses, etc.), inclusion of transformed data (e.g. log, -log) or using different standards (e.g. date, sex, species) that need to be cleaned up, usually when all the people involved have moved on to other projects/companies ??

Shivam Rawat

Senior Manager- Strategic Sourcing-Global Lead

3 年

Food for thought

Sourabh Chourasia

A passionate Cyclist and a Sr. QA Consultant - Data and Analytics

3 年

Very well written, in one of my projects, we were getting data from the multiple Australian states (each state had different file formats, different rules for gender, state, and various other coded fields). We started with analysing and sampling the data from the existing systems and identified the gap pretty early. That helped developers to write various transformation rules to handle these situations.

Arjun Patel

Data Engineering Manager

3 年

Very well said Adarsh.. No matter it’s data integeration or Data Migration, quality is of paramount importance. We tend to overlook columns and fields in the initial stages of the project, but during the later phases there business importance change and that’s where the projects land in trouble. At the end of the day to drive a business decission quality of data matters the most.

要查看或添加评论,请登录

Adarsh Srivastava的更多文章

社区洞察

其他会员也浏览了