2024 #6 - Unit Testing

2024 #6 - Unit Testing


In this edition of Analytics Engineering Today, we’re going to talk about unit testing: what it actually is, why it matters, and the misconception that running a few queries on your data is the same thing.

With some of the teams I’ve worked with, this distinction matters because testing often falls into the bucket of “let’s just check a few row counts and call it good.” That’s useful, but it’s not unit testing. Proper unit testing makes your transformations reliable, future-proof, and much easier to maintain.


What is Unit Testing?

Unit testing comes from software engineering. In that world, a “unit” is the smallest testable part of an application, usually a function or method. The purpose of unit tests is to check that these small components work exactly as intended, in isolation.

Key characteristics of unit tests:

? Isolation: Unit tests focus on the component being tested, without relying on databases, APIs, or other external systems. Mocks or stubs are used to simulate inputs.

? Granularity: They test the tiniest parts of the code - no big-picture testing here.

? Automation: Unit tests are automated and repeatable, giving quick feedback.

? Speed: Because they’re testing small, isolated pieces, unit tests run fast - ideal for frequent checks, for example in your CICD pipeline.


What Does Unit Testing Look Like in Data?

In data, unit tests focus on business logic and transformations. A “unit” could be:

  • A specific SQL transformation
  • A piece of business logic, like “if a customer has no purchases in 90 days, mark them as inactive.”

But, and this is important, unit tests don’t rely on your real data. Instead, they use static, controlled input data that you manufacture specifically to test edge cases, boundary conditions, and expected outputs.


What Unit Testing is NOT

It’s easy to get confused about what unit testing actually is. So let me clear this up.

  1. Not Manual Test Scripts Writing a query to check row counts or totals? That’s a manual test, not a unit test.
  2. Not High-Level Validation Validating that your data pipeline produces correct totals or has no nulls is data validation, not unit testing.
  3. Not Dependent on Live Data If your test relies on whatever data happens to be in the source system right now, it’s not a unit test. Unit tests use static, controlled inputs.
  4. Not End-to-End Pipeline Tests Testing an entire pipeline for correctness is an integration test, not a unit test.
  5. Not Ad-Hoc Checks Queries you write “just to see” if the data looks right? Not unit tests. Unit tests are structured, repeatable, and automated.

That's not to say the rest of these aren't useful. Of course they are. I'm just asking you all to stop calling them unit tests!


Why Unit Testing Matters: Future-Proofing Your Transformations

In my view, unit testing really shines in it's future-proofing capacity.

When you write transformations, you’re often coding for scenarios that don’t exist in the current data, but you have discussed with the stakeholders.

Without a unit test, someone could update or refactor your code and miss the edge case entirely. With a unit test, you:

? Simulate the relevant data.

? Confirm the logic works as intended.

? Protect this rule from future changes.

Unit tests act as guardrails for your code. They make sure the edge cases you’ve accounted for today don’t quietly break when someone else makes changes down the track.


Unit Testing in Action: A Real-Life Scenario

Imagine you’re building a rule to mark customers as “inactive” after 90 days without a purchase.

Your current data doesn’t have any 90-day gaps, so how do you know the rule works?

With unit testing:

? You create a small dataset with a known number of 90 day gaps.

? You write a test to validate that the “inactive” logic applies correctly, and returns the exact amout expected. Not one more, not one less. In unit testing, this is often called an assertion.

? Your rule works, now and in the future.

If someone changes the logic later, intentionally or not, your unit test will catch it.


How to implement Unit Tests

Implementation in practice will depend which tool you are using. Most tools that have kept pace with the move to DataOps have some unit testing functionality.

dbt implemented unit tests earlier in the year, see their docs.

In python a commonly used package is pytest. It needs to be installed. It's more comprehensive than unittest which is a built in library.


Unit testing isn’t about checking today’s data, or that you didn't drop rows when moving data around. It’s about building confidence that your transformations will hold up under any condition, even scenarios you haven’t seen yet.

If you’re not writing unit tests for your models and business logic yet, it’s worth starting. They’re fast to run, simple to implement, and they save you a lot of headaches in the long run.

I’d love to hear from you. Do you write unit tests in your data engineering or analytics engineering? What’s your approach to testing edge cases and business logic? If this is new to you and you’re keen to get started, reach out and let’s chat!


Lidio Santos

PhD Student | Software Engineering | Unit Testing | Governance IT | Risk Management

2 个月

Hi Meagan, congratulations on your approach to unit testing. I'm very interested in this topic, and in my academic researchs, I usually use PHPUnit. I've sometimes found myself evaluating external dependencies as unit tests instead of integration tests. Your approach clearly highlights the key characteristics of unit tests. Here in Brazil, I participate in a community called PHPRio, coordinated by Vitor Mattos, Daiane Alves, and Lucas Azevedo. We meet every month to discuss software development, with a focus on the PHP language (https://github.com/phprio)

回复
Ethan Hawkins

SQL || Java || Python || C || Haskell || Rust || TypeScript

3 个月

this as well as the data tests are my favourite things about dbt right now. Its awesome its is bringing the level of rigour in normal software development to data modelling work, especially with CICD. As well as it just being intrinsically satisfying to prove your code this way. Super underrated feature and good read!

要查看或添加评论,请登录

Meagan Palmer的更多文章

  • 2025 #01 - Referential integrity in your kimball data model

    2025 #01 - Referential integrity in your kimball data model

    Welcome back to Analytics Engineering Today for 2025. This edition is a direct response to many LinkedIn posts and…

  • 2024 #5 - SCD2 != dbt snapshot

    2024 #5 - SCD2 != dbt snapshot

    In this edition of Analytics Engineering Today, we are going to talk about dimension tables, snapshots and the…

    11 条评论
  • 2024 #4 - Microbatching: a deeper look

    2024 #4 - Microbatching: a deeper look

    (this article was updated after initial publishing to better reflect the timezone complexities and add a 'few days…

  • dbtCloud Sept/Oct 2024 - Release Notes

    dbtCloud Sept/Oct 2024 - Release Notes

    There have been a lot of announcements and releases out of dbt in the past couple of months. There is a clear drive to…

  • 2024 #3 - Using 'defer' in dbt to save time and money

    2024 #3 - Using 'defer' in dbt to save time and money

    In todays edition, I'm going to dive into a dbt feature that I feel is underutilised. Deferral.

    4 条评论
  • dbtCloud - July/Aug 2024 Release Notes

    dbtCloud - July/Aug 2024 Release Notes

    Today I'll be looking at a some key features that have recently been released into dbtCloud and the actions you might…

  • 2024 #2 - managing the grain of your data

    2024 #2 - managing the grain of your data

    Today's post is going to be about grain. The aim of this newsletter is to share stories and tips from things I have…

  • dbtCloud - June 2024 Release Notes

    dbtCloud - June 2024 Release Notes

    In today's edition, we will have a look at the latest from dbt Labs. Last week the release notes for dbtCloud June 2024…

    1 条评论
  • 2024 #1 - dbt docs to comments, test status lenses and Case Sensitivity

    2024 #1 - dbt docs to comments, test status lenses and Case Sensitivity

    I'm Meagan, your guide through the world of analytics engineering. In this newsletter I’ll be sharing insights gained…

  • Welcome to Analytics Engineering Today

    Welcome to Analytics Engineering Today

    Hello and welcome to the first issue of Analytics Engineering Today. I'm Meagan, a Data Consultant based in Sydney…

    8 条评论

社区洞察

其他会员也浏览了