DQ Engineering

DQ Engineering

DQ stands for Data Quality. If you don't have a background in data quality, read this first: https://www.alation.com/blog/what-is-data-quality-why-is-it-important/ And then read this so you understand DQ in the context of Data Engineering: https://www.projectpro.io/article/data-engineering-best-practices/943 (which is where I took the above image from).

In the last one year I’ve been privileged enough to work with DQ Engineers (for the first time). And now in my current role (just 3 weeks in!) I’m encountering DQ Engineer role again. So I thought it would be useful for others if I share my experience and views in DQ Engineering, which is a discipline in its own right.?

The thing about DQ Engineering is: business users don’t write SQL. And that is a big thing when you use Data Management tool like Atlan, Alation, Atacamma (why they all start with “A” I wonder) or Collibra, IDMC. They are data catalog, but they are also DQ tool. But to write the DQ rules with them you need to write it in SQL. Yes for “built-in” DQ rules like “must not be null”, “must contain one of these values”, “must be unique”, “count of rows must be within 5% of yesterday’s count” - for built-in basic DQ rules like that you can get it out of the box, without writing any SQL.?

But as you know, no DQ project is ever enough with the standard built-in rules. The business has a lot of “custom DQ rule”, like: “if the product type is X, then the start date must be 3 weeks after signing, whereas for product type Y the start date can be whenever”. These are called “business logic” and there are a lot of them. On customers, on products, on security (as in financial instrument), on financial accounts, etc. In a project you can have hundreds of these “business logic DQ rules”. And that helps identifying data issues. They are very very useful.?

But the thing is, there 2 things which affects the way you work (the operating model):?

  1. The business can’t write SQL, so we need someone called “DQ Engineer” to write the SQL.?
  2. There is serious engineering stuff in DQ Engineering. Not only you need to write the SQL, but you also have to test it, deploy it, run it, etc. During development you need CI/CD, code repository, pull request, etc. And during operation/production you will need to monitor that it runs daily, you need to store the output and then process the output.?

Say for today’s data you run 100 rules and got 40 data issues. 5 of them are new issues which you have not encountered before. 35 of them have happened before. So you need DQ team to fix those data issues in the source systems (for example, duplicate customers, a product without a product type, and so on).?

So you have 3 types of people in the DQ team:?

  1. Business user / business analyst: they specify the business rules.?
  2. DQ Engineer: they write the SQL based on the spec for BA/user in #1.?
  3. DQ Remediation: they fix the data in the source system based on the output of #2.?

I mentioned above that you have a serious amount of “engineering” in DQ. A lot of you are familiar with “data engineering”, which basically build data pipelines and data storage (and some transformations too, and some analytics too). DQ Engineering is as serious as that. You need proper kit to do the job. You can use the Data Management tools to do the job (Atlan, Alation, Atacama, Collibra, IDMC). But it is not enough. Why? They can’t do CI/CD, PR, Deployment, Execution, Monitoring, Logging, etc. So you need to build all that yourself, as a DQ Engineer. For this you need to use proper ELT/ETL tool, like ADF, Talend, PowerCentre, Matillion, dbt, DLT, PySpark Notebooks, Glue, Lambda, Alteryx, Hevo, Snaplogic, Abinitio, etc. That’s where you write the SQL to check the data, implementing the business rules from the BA/user.?

But then you need to run those tasks. In a way, it is part of your data pipeline. A data engineer is capable to become a DQ Engineer. If fact, DQ Engineering is part of Data Engineer’s job. Not only you need to build the data pipelines, but you also need to build the DQ rules on those pipelines.?

But if you look on Linkedin Jobs, there are separate DQ Engineer role (separate from Data Engineer). And this is because they are part of the DQ Team. They receive the business rules spec from BAs, and write the DQ Rules in Atlan, Alation, Atacama, Collibra, IDMC, etc. And not only that, any DQ Engineer is also good in setting up Data Catalog. They can point the catalog to a database and the catalog engine will scan the “metadata” and the “data lineage”. The metadata is the table names, column names, primary key, foreign keys, null constraint, data types, data profiles, etc. Data lineage is which procs/models are dependent on each table, and vice versa. ?

So I think DQ Engineering is a discipline on its own right. In this discipline the tasks/skills are:?

  1. Setting up DM tools Atlan, Alation, Atacama, Collibra, IDMC, etc.?
  2. Scanning source databases to get metadata and data lineage.?
  3. Write DQ rules and schedule them to run (timer or event based).?
  4. Output the result from #3 into tables and process them.?
  5. Report #4 using visualisation tools like Power BI,?Tableau, Qlik, Looker, Quicksight, Grafana, etc. Or lower level than that: Streamlit, Seaborn, Plotly, Shiny, Bokeh.?
  6. Help troubleshooting the data issues in #4 above. ?
  7. Resolve the data issues in #4 using transformations. ?

In the above 7 tasks, point 5 is very important. That visualisation/report is read by many business users. It is the part of the whole DQ Team that is visible from the senior management.?

Point 7 is also very important, because not all data issues can be fixed manually in the source system. If it’s only 10 records, yes by all means. If it’s 1000 records than it will take a long time. So DQ Engineer needs to help fixing the data programmatically, based on the “fixing rules” specified by the BA/business.?

One last thing: there are 2 things where you run DQ rules:?

  1. On the data source.?
  2. On the presentation layer, after the data is transformed.?

For #1 on the data source, you check for things like: “a product must have a product type”. “a client must have last name, first name, DOB and address”.?

But for #2, after the data is transformed into say dim_customer and dim_dimension, you also need to check that the mandatory fields are populated correctly. For example, in the dim_security, maturity value should be between 0 and 15 years, in the ESG fact table the carbon emission must be in certain range for companies in the certain industry.?

Why do you need to check the presentation layer? Because those fields are calculated/derived. So all the ingredients could be fine in the source system but after they are calculated the output could be not valid.?

Hope this is useful. Keep learning!?

If you need a book on Data Quality Engineering, my suggestion is this: https://www.amazon.co.uk/Data-Quality-Engineering-Financial-Services-ebook/dp/B0BJTVVT3S

And if you need Data Engineering

Dat TRIEU

Data enthusiast | Microsoft Certified Associate: (1) Fabric Analytics Engineer, (2) Azure Enterprise Data Analyst, (3) Power BI Data Analyst | Blogger | MSc - Quantitative Economics

1 天前

Thanks Vincent for the great content, as always! Have you ever used and seen companies use Microsoft Purview as a Data Catalog/Data Quality tool? I didn't see you mention it. What's your thought about Purview? Thanks!

Christian Nazareno

Data Engineer | Software Engineer | GCP 1x | Snowflake - DBT - Python - SQL - AWS | IOT enthusiast

1 天前

Vincent Rainardi . I work with Snowflake as a data platform, Jenkins for CI/CD and GitHub. We run certain DQ rules with dbt. Elementary helps us with some more specific DQ tests like schema changes, volume anomalies, data freshness, among others , it also helps create some artifacts in the data platform with test results and metadata that can be used in dashboards. Databand for alerting, monitoring , and observability, and Atlan as the catalog and lineage, that is a high level. I am going to take a look at the tools you mentioned too

Christian Nazareno

Data Engineer | Software Engineer | GCP 1x | Snowflake - DBT - Python - SQL - AWS | IOT enthusiast

2 天前

Thanks for sharing Vincent Rainardi I have had the privileged opportunity to merge my Data engineering role with some analytics and DQ engineering and it's true that is crucial to develop a strong DQ process for all data products in our companies so that way we leverage the most of its power. Here is some of the stack I have used DBT, elementary, databand and Atlan

要查看或添加评论,请登录

Vincent Rainardi的更多文章

  • Data Product

    Data Product

    For those of you who don't know what a data product and “data as a product” are, please read this first:…

    13 条评论
  • Snowflake vs SQL Server

    Snowflake vs SQL Server

    Sometimes we need to remind ourselves that Snowflake is not an OLTP database. I know today is the era of Hybrid tables…

    6 条评论
  • Data engineer becoming solution architect

    Data engineer becoming solution architect

    Are you a data engineer thinking about transitioning to a cloud solution architect? Data engineer are good with…

    2 条评论
  • Asset Mgt vs Fund Mgt vs Investment Mgt vs Wealth Mgt: What's the difference?

    Asset Mgt vs Fund Mgt vs Investment Mgt vs Wealth Mgt: What's the difference?

    If you work in banking or investment or any other sector in financial services, you might be wondering about the above.…

  • Data Warehousing Basics: Cost

    Data Warehousing Basics: Cost

    If you call yourself a data engineer you need to be aware of 2 additional things compared to a developer. The first one…

    2 条评论
  • My Linkedin post & articles

    My Linkedin post & articles

    The list below goes back to Nov 2024. For older than that see here.

    9 条评论
  • Data Warehousing Basics: Single Customer View

    Data Warehousing Basics: Single Customer View

    Imagine that you work for an insurance company who sell health insurance (HI), life insurance (LI), general insurance…

    2 条评论
  • Data Warehousing Basics: NFR

    Data Warehousing Basics: NFR

    What I’m about to tell you today failed a lot of data warehousing projects which is why it’s worth paying attention so…

    1 条评论
  • ML and AI - What's the difference?

    ML and AI - What's the difference?

    Machine Learning covers about 20-30 algorithms such as Logistic Regression, Decision Tree, Gradient Boosting, Random…

    5 条评论
  • Microsoft Fabric or Synapse Analytics?

    Microsoft Fabric or Synapse Analytics?

    When it comes to Data Warehousing, Microsoft is confusing. Why? Because it has Microsoft Fabric and it also has Synapse…

    16 条评论