Dataform Assertion Mastery: Smart Data Quality Monitoring in GCP BigQuery
Introduction
In the digital age, data has become the lifeblood of organizations across industries, driving decision-making, innovation, and competitive advantage. However, the value derived from data is heavily contingent on its quality. Poor-quality data can lead to erroneous conclusions, flawed insights, and ultimately, misguided business decisions. Therefore, ensuring data quality is paramount for organizations seeking to harness the full potential of their data assets.
Early in the data lifecycle, during data ingestion and preparation stages, proactive measures must be taken to enhance data quality. These stages are critical as they lay the foundation for downstream analytics and decision-making processes. By addressing data quality issues at this nascent stage, organizations can mitigate risks, improve the accuracy of insights, and unlock the true value of their data.
Enhancing data quality begins with establishing robust data governance frameworks and implementing best practices for data management. This involves defining data standards, policies, and procedures to ensure consistency, completeness, accuracy, and reliability of data across the organization.
Dataform is a tool designed for data analysts and engineers to manage data transformation workflows in a code-centric manner. It allows users to define, schedule, and orchestrate data transformations using SQL and JavaScript.
In Dataform, ensuring data quality is central to its approach to data transformation and management. Dataform provides a robust framework for addressing data quality concerns throughout the data lifecycle. At the heart of Dataform's data quality strategy lies the concept of assertions , which play a pivotal role in validating the integrity and correctness of data.
Assertion functionality
Dataform's assertion functionality allows you to verify the quality and correctness of your data transformations by defining validation rules directly within your Dataform workflow. These rules are typically expressed using SQL queries - giving lot of flexibility - and are executed automatically as part of your workflow.
Assertions are SQL queries that check the validity of your data according to specific criteria. You can define assertions in your Dataform project using the assert functionality within your workflow. For example, you might assert that certain columns cannot contain NULL values, that numeric values fall within a certain range, or that data conforms to a particular schema.
This can be done using two available options:
Based on the configuration this can run before or after a transformation (checking the input data stream or the output data stream) as well as block or not the follow-up steps.
Create built-in assertions
Built-in assertions are the probably most common ones and are defined within the config block of a Dataform .sqlx file
These are predefined data quality checks that you can add to your table definitions. They help to ensure that your data meets specific criteria or rules.
The config block is part of your Dataform table definition. It contains metadata and configuration settings for the table. Assertions are specified within this block.
Examples of Built-in Assertions:
config {
type: "table",
assertions: {
uniqueKey: ["user_id"],
nonNull: ["user_id", "customer_id", "email"],
rowConditions: [
'signup_date is null or signup_date > "2019-01-01"',
'email like "%@%.%"'
]
}
}
SELECT ...
Create manual assertions with SQLX
Manual assertions are useful for advanced scenarios or when dealing with datasets not directly created by Dataform (and thus a config{} is not available to set the assertion there).
When built-in assertions don’t cover your specific use case, you can create custom assertions. These assertions are written in a separate SQLX file.
Use Cases for Manual Assertions:
config { type: "assertion" }
SELECT
*
FROM
${ref("sometable")}
WHERE
a IS NULL
OR b IS NULL
OR c IS NULL
You can also establish connections between assertions that verify information in other tables before or after dependent transformations are executed. For example, you could verify in the primary metadata table whether a specific table has been created (especially if you anticipate a daily drop/create operation). This means the workflow would only commence if the updated data with drop/create has been delivered. Alternatively, this check could be conducted afterward to serve as an alert.
Possible outcome of Assertion run
Each assertion is associated with its own dataset in your BigQuery project. This dataset contains the results of the assertion queries. If an assertion fails, rows violating the rules appear in this dataset.
The possible outcomes of Dataform assertions:
Success:
Failure:
How do you monitor data assertions results?
In the Google Cloud Platform (GCP) ecosystem, Dataform operates seamlessly with BigQuery (BQ), Google's fully managed, serverless data warehouse. Within this context, assertions in Dataform are indeed SQL statements that are executed on the BigQuery engine.
领英推荐
Here's how assertions function within the GCP environment with Dataform:
How to implement monitoring and alerting? (The official way)
Google propagates monitoring and alerting for Dataform SQLX runs (including Dataform assertions) through its Cloud Logging and log-based alerts mechanisms.
Here's a summary of how Google facilitates these processes:
In summary, Google promotes the use of Cloud Logging and log-based alerts as the official way to monitor and alert on Dataform SQLX executions, including Dataform assertions. These mechanisms provide real-time visibility into data pipeline operations, enable proactive detection of issues, and facilitate timely response to maintain data integrity and reliability.
A "quicker" way to collect assert run information
Google Cloud Platform (GCP) provides a default view called INFORMATION_SCHEMA.JOBS within BigQuery (BQ) , which captures metadata about jobs that run on the BigQuery engine, including Dataform assertions. This default view allows users to access job-related information directly from BigQuery without the need to interact with Cloud Logging. Here's a description of the default view:
In summary, the INFORMATION_SCHEMA.JOBS view in BigQuery serves as a centralized repository for accessing metadata about jobs executed within the BigQuery engine, including Dataform assertions. By querying this default view, users can gain visibility into job execution history, monitor resource usage, and analyze job-related metrics directly within BigQuery, enhancing the monitoring and management of data transformation workflows.
For further insights into leveraging the metadata contained in the INFORMATION_SCHEMA.JOBS view in BigQuery, for different purposes, feel free to explore my previous article: "BigQuery Metadata: A Hidden and Often Undervalued Treasure in Data Management ".
What are the benefits of the "quicker approach" using just the official default BQ metadata tables?
Using the official default BigQuery metadata tables provides several benefits over relying solely on Cloud Logging for monitoring data transformations:
In summary, the quicker approach of utilizing the official default BigQuery metadata tables for monitoring offers simplicity, integration, reduced latency, cost-efficiency, granular control, and a consistent user experience. While Cloud Logging remains valuable for comprehensive monitoring and alerting needs, leveraging BigQuery's built-in metadata tables provides a pragmatic and efficient solution for monitoring data transformations directly within the BigQuery environment.
One way to create a base of data only on Assertion checks would be by implementing a periodic SQL code (e.g. Dataform pipeline | best if incremental) to query the metadata table with job information, isolate assertions, and build another smaller table can streamline monitoring and facilitate the creation of a dedicated dashboard for data quality assurance. Here's how this process can be structured and the benefits it offers:
The resultant table can be connected to a dashboard for user exploration. Establishing this in-depth analysis of the real-time assertion checks requires minimal setup time and offers a comprehensive perspective, which can also be further enriched, on the performed data quality checks.
Conclusion
In conclusion, as the importance of data quality checks continues to grow in the data lifecycle, ensuring accurate insights and informed decision-making becomes paramount. Dataform emerges as a robust solution, offering the capability to embed data quality checks, known as assertions, directly into data processing workflows.
While it is possible to monitor data assertions within GCP Dataform UI, which is primarily designed for check and fix failures, rather than comprehensive stats collection and clear overview as achievable in a dashboard, leveraging cloud logging and cloud monitoring is the propagated official way and it allow . However, a more efficient alternative lies in harnessing BigQuery's INFORMATION_SCHEMA.JOBS and extracting pertinent data from run assertions, along with their descriptions and key metrics.
This approach boasts several advantages, notably eliminating the need to configure and use cloud logs, as data are already stored in BigQuery. Additionally, it offers a familiar interface, requires no additional permissions, and facilitates seamless integration into existing workflows. By periodically extracting this subset of information, organizations can efficiently monitor data quality and empower stakeholders with a comprehensive dashboard for insightful analysis and decision-making.
References and Further Reading
To explore further into the realm of Dataform Assertion and various ways to utilize derived information, you may find the following resources valuable: