DQ Engineering
DQ stands for Data Quality. If you don't have a background in data quality, read this first: https://www.alation.com/blog/what-is-data-quality-why-is-it-important/ And then read this so you understand DQ in the context of Data Engineering: https://www.projectpro.io/article/data-engineering-best-practices/943 (which is where I took the above image from).
In the last one year I’ve been privileged enough to work with DQ Engineers (for the first time). And now in my current role (just 3 weeks in!) I’m encountering DQ Engineer role again. So I thought it would be useful for others if I share my experience and views in DQ Engineering, which is a discipline in its own right.?
The thing about DQ Engineering is: business users don’t write SQL. And that is a big thing when you use Data Management tool like Atlan, Alation, Atacamma (why they all start with “A†I wonder) or Collibra, IDMC. They are data catalog, but they are also DQ tool. But to write the DQ rules with them you need to write it in SQL. Yes for “built-in†DQ rules like “must not be nullâ€, “must contain one of these valuesâ€, “must be uniqueâ€, “count of rows must be within 5% of yesterday’s count†- for built-in basic DQ rules like that you can get it out of the box, without writing any SQL.?
But as you know, no DQ project is ever enough with the standard built-in rules. The business has a lot of “custom DQ ruleâ€, like: “if the product type is X, then the start date must be 3 weeks after signing, whereas for product type Y the start date can be wheneverâ€. These are called “business logic†and there are a lot of them. On customers, on products, on security (as in financial instrument), on financial accounts, etc. In a project you can have hundreds of these “business logic DQ rulesâ€. And that helps identifying data issues. They are very very useful.?
But the thing is, there 2 things which affects the way you work (the operating model):?
- The business can’t write SQL, so we need someone called “DQ Engineer†to write the SQL.?
- There is serious engineering stuff in DQ Engineering. Not only you need to write the SQL, but you also have to test it, deploy it, run it, etc. During development you need CI/CD, code repository, pull request, etc. And during operation/production you will need to monitor that it runs daily, you need to store the output and then process the output.?
Say for today’s data you run 100 rules and got 40 data issues. 5 of them are new issues which you have not encountered before. 35 of them have happened before. So you need DQ team to fix those data issues in the source systems (for example, duplicate customers, a product without a product type, and so on).?
So you have 3 types of people in the DQ team:?
- Business user / business analyst: they specify the business rules.?
- DQ Engineer: they write the SQL based on the spec for BA/user in #1.?
- DQ Remediation: they fix the data in the source system based on the output of #2.?
I mentioned above that you have a serious amount of “engineering†in DQ. A lot of you are familiar with “data engineeringâ€, which basically build data pipelines and data storage (and some transformations too, and some analytics too). DQ Engineering is as serious as that. You need proper kit to do the job. You can use the Data Management tools to do the job (Atlan, Alation, Atacama, Collibra, IDMC). But it is not enough. Why? They can’t do CI/CD, PR, Deployment, Execution, Monitoring, Logging, etc. So you need to build all that yourself, as a DQ Engineer. For this you need to use proper ELT/ETL tool, like ADF, Talend, PowerCentre, Matillion, dbt, DLT, PySpark Notebooks, Glue, Lambda, Alteryx, Hevo, Snaplogic, Abinitio, etc. That’s where you write the SQL to check the data, implementing the business rules from the BA/user.?
But then you need to run those tasks. In a way, it is part of your data pipeline. A data engineer is capable to become a DQ Engineer. If fact, DQ Engineering is part of Data Engineer’s job. Not only you need to build the data pipelines, but you also need to build the DQ rules on those pipelines.?
But if you look on Linkedin Jobs, there are separate DQ Engineer role (separate from Data Engineer). And this is because they are part of the DQ Team. They receive the business rules spec from BAs, and write the DQ Rules in Atlan, Alation, Atacama, Collibra, IDMC, etc. And not only that, any DQ Engineer is also good in setting up Data Catalog. They can point the catalog to a database and the catalog engine will scan the “metadata†and the “data lineageâ€. The metadata is the table names, column names, primary key, foreign keys, null constraint, data types, data profiles, etc. Data lineage is which procs/models are dependent on each table, and vice versa. ?
So I think DQ Engineering is a discipline on its own right. In this discipline the tasks/skills are:?
- Setting up DM tools Atlan, Alation, Atacama, Collibra, IDMC, etc.?
- Scanning source databases to get metadata and data lineage.?
- Write DQ rules and schedule them to run (timer or event based).?
- Output the result from #3 into tables and process them.?
- Report #4 using visualisation tools like Power BI,?Tableau, Qlik, Looker, Quicksight, Grafana, etc. Or lower level than that: Streamlit, Seaborn, Plotly, Shiny, Bokeh.?
- Help troubleshooting the data issues in #4 above. ?
- Resolve the data issues in #4 using transformations. ?
In the above 7 tasks, point 5 is very important. That visualisation/report is read by many business users. It is the part of the whole DQ Team that is visible from the senior management.?
Point 7 is also very important, because not all data issues can be fixed manually in the source system. If it’s only 10 records, yes by all means. If it’s 1000 records than it will take a long time. So DQ Engineer needs to help fixing the data programmatically, based on the “fixing rules†specified by the BA/business.?
One last thing: there are 2 things where you run DQ rules:?
- On the data source.?
- On the presentation layer, after the data is transformed.?
For #1 on the data source, you check for things like: “a product must have a product typeâ€. “a client must have last name, first name, DOB and addressâ€.?
But for #2, after the data is transformed into say dim_customer and dim_dimension, you also need to check that the mandatory fields are populated correctly. For example, in the dim_security, maturity value should be between 0 and 15 years, in the ESG fact table the carbon emission must be in certain range for companies in the certain industry.?
Why do you need to check the presentation layer? Because those fields are calculated/derived. So all the ingredients could be fine in the source system but after they are calculated the output could be not valid.?
Hope this is useful. Keep learning!?
If you need a book on Data Quality Engineering, my suggestion is this: https://www.amazon.co.uk/Data-Quality-Engineering-Financial-Services-ebook/dp/B0BJTVVT3S
And if you need Data Engineering
Data enthusiast | Microsoft Certified Associate: (1) Fabric Analytics Engineer, (2) Azure Enterprise Data Analyst, (3) Power BI Data Analyst | Blogger | MSc - Quantitative Economics
1 天å‰Thanks Vincent for the great content, as always! Have you ever used and seen companies use Microsoft Purview as a Data Catalog/Data Quality tool? I didn't see you mention it. What's your thought about Purview? Thanks!
Data Engineer | Software Engineer | GCP 1x | Snowflake - DBT - Python - SQL - AWS | IOT enthusiast
1 天å‰Vincent Rainardi . I work with Snowflake as a data platform, Jenkins for CI/CD and GitHub. We run certain DQ rules with dbt. Elementary helps us with some more specific DQ tests like schema changes, volume anomalies, data freshness, among others , it also helps create some artifacts in the data platform with test results and metadata that can be used in dashboards. Databand for alerting, monitoring , and observability, and Atlan as the catalog and lineage, that is a high level. I am going to take a look at the tools you mentioned too
Data Engineer | Software Engineer | GCP 1x | Snowflake - DBT - Python - SQL - AWS | IOT enthusiast
2 天å‰Thanks for sharing Vincent Rainardi I have had the privileged opportunity to merge my Data engineering role with some analytics and DQ engineering and it's true that is crucial to develop a strong DQ process for all data products in our companies so that way we leverage the most of its power. Here is some of the stack I have used DBT, elementary, databand and Atlan