ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

DQ Engineering

Vincent Rainardi

Data Architect & Data Engineer

å‘å¸ƒæ—¥æœŸ: 2025å¹´3æœˆ23æ—¥

DQ stands for Data Quality. If you don't have a background in data quality, read this first: https://www.alation.com/blog/what-is-data-quality-why-is-it-important/ And then read this so you understand DQ in the context of Data Engineering: https://www.projectpro.io/article/data-engineering-best-practices/943 (which is where I took the above image from).

In the last one year Iâ€™ve been privileged enough to work with DQ Engineers (for the first time). And now in my current role (just 3 weeks in!) Iâ€™m encountering DQ Engineer role again. So I thought it would be useful for others if I share my experience and views in DQ Engineering, which is a discipline in its own right.?

The thing about DQ Engineering is: business users donâ€™t write SQL. And that is a big thing when you use Data Management tool like Atlan, Alation, Atacamma (why they all start with â€œAâ€ I wonder) or Collibra, IDMC. They are data catalog, but they are also DQ tool. But to write the DQ rules with them you need to write it in SQL. Yes for â€œbuilt-inâ€ DQ rules like â€œmust not be nullâ€, â€œmust contain one of these valuesâ€, â€œmust be uniqueâ€, â€œcount of rows must be within 5% of yesterdayâ€™s countâ€ - for built-in basic DQ rules like that you can get it out of the box, without writing any SQL.?

But as you know, no DQ project is ever enough with the standard built-in rules. The business has a lot of â€œcustom DQ ruleâ€, like: â€œif the product type is X, then the start date must be 3 weeks after signing, whereas for product type Y the start date can be wheneverâ€. These are called â€œbusiness logicâ€ and there are a lot of them. On customers, on products, on security (as in financial instrument), on financial accounts, etc. In a project you can have hundreds of these â€œbusiness logic DQ rulesâ€. And that helps identifying data issues. They are very very useful.?

But the thing is, there 2 things which affects the way you work (the operating model):?

The business canâ€™t write SQL, so we need someone called â€œDQ Engineerâ€ to write the SQL.?
There is serious engineering stuff in DQ Engineering. Not only you need to write the SQL, but you also have to test it, deploy it, run it, etc. During development you need CI/CD, code repository, pull request, etc. And during operation/production you will need to monitor that it runs daily, you need to store the output and then process the output.?

Say for todayâ€™s data you run 100 rules and got 40 data issues. 5 of them are new issues which you have not encountered before. 35 of them have happened before. So you need DQ team to fix those data issues in the source systems (for example, duplicate customers, a product without a product type, and so on).?

So you have 3 types of people in the DQ team:?

Business user / business analyst: they specify the business rules.?
DQ Engineer: they write the SQL based on the spec for BA/user in #1.?
DQ Remediation: they fix the data in the source system based on the output of #2.?

I mentioned above that you have a serious amount of â€œengineeringâ€ in DQ. A lot of you are familiar with â€œdata engineeringâ€, which basically build data pipelines and data storage (and some transformations too, and some analytics too). DQ Engineering is as serious as that. You need proper kit to do the job. You can use the Data Management tools to do the job (Atlan, Alation, Atacama, Collibra, IDMC). But it is not enough. Why? They canâ€™t do CI/CD, PR, Deployment, Execution, Monitoring, Logging, etc. So you need to build all that yourself, as a DQ Engineer. For this you need to use proper ELT/ETL tool, like ADF, Talend, PowerCentre, Matillion, dbt, DLT, PySpark Notebooks, Glue, Lambda, Alteryx, Hevo, Snaplogic, Abinitio, etc. Thatâ€™s where you write the SQL to check the data, implementing the business rules from the BA/user.?

But then you need to run those tasks. In a way, it is part of your data pipeline. A data engineer is capable to become a DQ Engineer. If fact, DQ Engineering is part of Data Engineerâ€™s job. Not only you need to build the data pipelines, but you also need to build the DQ rules on those pipelines.?

But if you look on Linkedin Jobs, there are separate DQ Engineer role (separate from Data Engineer). And this is because they are part of the DQ Team. They receive the business rules spec from BAs, and write the DQ Rules in Atlan, Alation, Atacama, Collibra, IDMC, etc. And not only that, any DQ Engineer is also good in setting up Data Catalog. They can point the catalog to a database and the catalog engine will scan the â€œmetadataâ€ and the â€œdata lineageâ€. The metadata is the table names, column names, primary key, foreign keys, null constraint, data types, data profiles, etc. Data lineage is which procs/models are dependent on each table, and vice versa. ?

So I think DQ Engineering is a discipline on its own right. In this discipline the tasks/skills are:?

Setting up DM tools Atlan, Alation, Atacama, Collibra, IDMC, etc.?
Scanning source databases to get metadata and data lineage.?
Write DQ rules and schedule them to run (timer or event based).?
Output the result from #3 into tables and process them.?
Report #4 using visualisation tools like Power BI,?Tableau, Qlik, Looker, Quicksight, Grafana, etc. Or lower level than that: Streamlit, Seaborn, Plotly, Shiny, Bokeh.?
Help troubleshooting the data issues in #4 above. ?
Resolve the data issues in #4 using transformations. ?

In the above 7 tasks, point 5 is very important. That visualisation/report is read by many business users. It is the part of the whole DQ Team that is visible from the senior management.?

Point 7 is also very important, because not all data issues can be fixed manually in the source system. If itâ€™s only 10 records, yes by all means. If itâ€™s 1000 records than it will take a long time. So DQ Engineer needs to help fixing the data programmatically, based on the â€œfixing rulesâ€ specified by the BA/business.?

One last thing: there are 2 things where you run DQ rules:?

On the data source.?
On the presentation layer, after the data is transformed.?

For #1 on the data source, you check for things like: â€œa product must have a product typeâ€. â€œa client must have last name, first name, DOB and addressâ€.?

But for #2, after the data is transformed into say dim_customer and dim_dimension, you also need to check that the mandatory fields are populated correctly. For example, in the dim_security, maturity value should be between 0 and 15 years, in the ESG fact table the carbon emission must be in certain range for companies in the certain industry.?

Why do you need to check the presentation layer? Because those fields are calculated/derived. So all the ingredients could be fine in the source system but after they are calculated the output could be not valid.?

Hope this is useful. Keep learning!?

If you need a book on Data Quality Engineering, my suggestion is this: https://www.amazon.co.uk/Data-Quality-Engineering-Financial-Services-ebook/dp/B0BJTVVT3S

And if you need Data Engineering

Dat TRIEU

Data enthusiast | Microsoft Certified Associate: (1) Fabric Analytics Engineer, (2) Azure Enterprise Data Analyst, (3) Power BI Data Analyst | Blogger | MSc - Quantitative Economics

1 å¤©å‰

Thanks Vincent for the great content, as always! Have you ever used and seen companies use Microsoft Purview as a Data Catalog/Data Quality tool? I didn't see you mention it. What's your thought about Purview? Thanks!

èµž

å›žå¤

1 æ¬¡å›žåº”

Christian Nazareno

Data Engineer | Software Engineer | GCP 1x | Snowflake - DBT - Python - SQL - AWS | IOT enthusiast

1 å¤©å‰

Vincent Rainardi . I work with Snowflake as a data platform, Jenkins for CI/CD and GitHub. We run certain DQ rules with dbt. Elementary helps us with some more specific DQ tests like schema changes, volume anomalies, data freshness, among others , it also helps create some artifacts in the data platform with test results and metadata that can be used in dashboards. Databand for alerting, monitoring , and observability, and Atlan as the catalog and lineage, that is a high level. I am going to take a look at the tools you mentioned too

èµž

å›žå¤

1 æ¬¡å›žåº”

Christian Nazareno

Data Engineer | Software Engineer | GCP 1x | Snowflake - DBT - Python - SQL - AWS | IOT enthusiast

2 å¤©å‰

Thanks for sharing Vincent Rainardi I have had the privileged opportunity to merge my Data engineering role with some analytics and DQ engineering and it's true that is crucial to develop a strong DQ process for all data products in our companies so that way we leverage the most of its power. Here is some of the stack I have used DBT, elementary, databand and Atlan

èµž

å›žå¤

1 æ¬¡å›žåº”

æŸ¥çœ‹æ›´å¤šè¯„è®º

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Vincent Rainardiçš„æ›´å¤šæ–‡ç«

Data Product

2025å¹´3æœˆ23æ—¥

Data Product

For those of you who don't know what a data product and â€œdata as a productâ€ are, please read this first:â€¦

13 æ¡è¯„è®º
Snowflake vs SQL Server

2025å¹´3æœˆ20æ—¥

Snowflake vs SQL Server

Sometimes we need to remind ourselves that Snowflake is not an OLTP database. I know today is the era of Hybrid tablesâ€¦

6 æ¡è¯„è®º
Data engineer becoming solution architect

2025å¹´3æœˆ18æ—¥

Data engineer becoming solution architect

Are you a data engineer thinking about transitioning to a cloud solution architect? Data engineer are good withâ€¦

2 æ¡è¯„è®º
Asset Mgt vs Fund Mgt vs Investment Mgt vs Wealth Mgt: What's the difference?

2025å¹´3æœˆ17æ—¥

Asset Mgt vs Fund Mgt vs Investment Mgt vs Wealth Mgt: What's the difference?

If you work in banking or investment or any other sector in financial services, you might be wondering about the above.â€¦
Data Warehousing Basics: Cost

2025å¹´3æœˆ16æ—¥

Data Warehousing Basics: Cost

If you call yourself a data engineer you need to be aware of 2 additional things compared to a developer. The first oneâ€¦

2 æ¡è¯„è®º
My Linkedin post & articles

2025å¹´3æœˆ15æ—¥

My Linkedin post & articles

The list below goes back to Nov 2024. For older than that see here.

9 æ¡è¯„è®º
Data Warehousing Basics: Single Customer View

2025å¹´3æœˆ15æ—¥

Data Warehousing Basics: Single Customer View

Imagine that you work for an insurance company who sell health insurance (HI), life insurance (LI), general insuranceâ€¦

2 æ¡è¯„è®º
Data Warehousing Basics: NFR

2025å¹´3æœˆ15æ—¥

Data Warehousing Basics: NFR

What Iâ€™m about to tell you today failed a lot of data warehousing projects which is why itâ€™s worth paying attention soâ€¦

1 æ¡è¯„è®º
ML and AI - What's the difference?

2025å¹´3æœˆ13æ—¥

ML and AI - What's the difference?

Machine Learning covers about 20-30 algorithms such as Logistic Regression, Decision Tree, Gradient Boosting, Randomâ€¦

5 æ¡è¯„è®º
Microsoft Fabric or Synapse Analytics?

2025å¹´3æœˆ11æ—¥

Microsoft Fabric or Synapse Analytics?

When it comes to Data Warehousing, Microsoft is confusing. Why? Because it has Microsoft Fabric and it also has Synapseâ€¦

16 æ¡è¯„è®º

See all articles

Vincent Rainardiçš„æ›´å¤šæ–‡ç«

Data Product

Snowflake vs SQL Server

Data engineer becoming solution architect

Asset Mgt vs Fund Mgt vs Investment Mgt vs Wealth Mgt: What's the difference?

Data Warehousing Basics: Cost

My Linkedin post & articles

Data Warehousing Basics: Single Customer View

Data Warehousing Basics: NFR

ML and AI - What's the difference?

Microsoft Fabric or Synapse Analytics?