Dataform Assertion Mastery: Smart Data Quality Monitoring in GCP BigQuery
Image partially generated with MS Co-Pilot GenAI (Image Creator from Designer)

Dataform Assertion Mastery: Smart Data Quality Monitoring in GCP BigQuery


Introduction

In the digital age, data has become the lifeblood of organizations across industries, driving decision-making, innovation, and competitive advantage. However, the value derived from data is heavily contingent on its quality. Poor-quality data can lead to erroneous conclusions, flawed insights, and ultimately, misguided business decisions. Therefore, ensuring data quality is paramount for organizations seeking to harness the full potential of their data assets.

Early in the data lifecycle, during data ingestion and preparation stages, proactive measures must be taken to enhance data quality. These stages are critical as they lay the foundation for downstream analytics and decision-making processes. By addressing data quality issues at this nascent stage, organizations can mitigate risks, improve the accuracy of insights, and unlock the true value of their data.

Enhancing data quality begins with establishing robust data governance frameworks and implementing best practices for data management. This involves defining data standards, policies, and procedures to ensure consistency, completeness, accuracy, and reliability of data across the organization.

Dataform is a tool designed for data analysts and engineers to manage data transformation workflows in a code-centric manner. It allows users to define, schedule, and orchestrate data transformations using SQL and JavaScript.

In Dataform, ensuring data quality is central to its approach to data transformation and management. Dataform provides a robust framework for addressing data quality concerns throughout the data lifecycle. At the heart of Dataform's data quality strategy lies the concept of assertions , which play a pivotal role in validating the integrity and correctness of data.


Compiled graph for a Dataform pipeline (source, transformations, assertions)


Assertion functionality

Dataform's assertion functionality allows you to verify the quality and correctness of your data transformations by defining validation rules directly within your Dataform workflow. These rules are typically expressed using SQL queries - giving lot of flexibility - and are executed automatically as part of your workflow.

Assertions are SQL queries that check the validity of your data according to specific criteria. You can define assertions in your Dataform project using the assert functionality within your workflow. For example, you might assert that certain columns cannot contain NULL values, that numeric values fall within a certain range, or that data conforms to a particular schema.

This can be done using two available options:

  • Create built-in assertions
  • Create manual assertions with SQLX

Based on the configuration this can run before or after a transformation (checking the input data stream or the output data stream) as well as block or not the follow-up steps.

Create built-in assertions

Built-in assertions are the probably most common ones and are defined within the config block of a Dataform .sqlx file

These are predefined data quality checks that you can add to your table definitions. They help to ensure that your data meets specific criteria or rules.

The config block is part of your Dataform table definition. It contains metadata and configuration settings for the table. Assertions are specified within this block.

Examples of Built-in Assertions:

  1. nonNull: Asserts that specified columns are not null across all rows. For example, if you have a user_id, customer_id, and email column, you can ensure that none of these values are null.
  2. rowConditions: Allows you to define custom logic for each row. You can write SQL conditions that must be true for individual rows. For instance, you might check if the signup_date is greater than 2019-01-01 for each signup_date.

config {
  type: "table",
  assertions: {
    uniqueKey: ["user_id"],
    nonNull: ["user_id", "customer_id", "email"],
    rowConditions: [
      'signup_date is null or signup_date > "2019-01-01"',
      'email like "%@%.%"'
    ]
  }
}
SELECT ...        

Create manual assertions with SQLX

Manual assertions are useful for advanced scenarios or when dealing with datasets not directly created by Dataform (and thus a config{} is not available to set the assertion there).

When built-in assertions don’t cover your specific use case, you can create custom assertions. These assertions are written in a separate SQLX file.

Use Cases for Manual Assertions:

  • External Data Sources: If your data comes from external sources (not generated by Dataform), manual assertions allow you to validate it.
  • Complex Rules: For more complex data quality checks beyond what built-in assertions offer. This covers also custom business logic, when you need to enforce specific business rules unique to your organization.

config { type: "assertion" }

SELECT
  *
FROM
  ${ref("sometable")}
WHERE
  a IS NULL
  OR b IS NULL
  OR c IS NULL        

You can also establish connections between assertions that verify information in other tables before or after dependent transformations are executed. For example, you could verify in the primary metadata table whether a specific table has been created (especially if you anticipate a daily drop/create operation). This means the workflow would only commence if the updated data with drop/create has been delivered. Alternatively, this check could be conducted afterward to serve as an alert.

Sample of


Possible outcome of Assertion run

Each assertion is associated with its own dataset in your BigQuery project. This dataset contains the results of the assertion queries. If an assertion fails, rows violating the rules appear in this dataset.

The possible outcomes of Dataform assertions:

Success:

  • When an assertion succeeds, it means that no rows violate the specified rules.
  • Your data meets the expected criteria, and the assertion passes.
  • This outcome indicates good data quality.

Failure:

  • If an assertion fails, it means that some rows violate the rules.
  • These violations indicate data quality issues.
  • You need to monitor the assertion result manually accessing Dataform UI or set up specific additional tools/development to facilitate monitoring and alerting.

Dataform UI - Assertion failure detail


How do you monitor data assertions results?

In the Google Cloud Platform (GCP) ecosystem, Dataform operates seamlessly with BigQuery (BQ), Google's fully managed, serverless data warehouse. Within this context, assertions in Dataform are indeed SQL statements that are executed on the BigQuery engine.

Here's how assertions function within the GCP environment with Dataform:

  1. Integration with BigQuery: Dataform pipelines are typically configured to interact with BigQuery as the underlying data storage and processing engine. Users define their data transformation workflows, including SQL-based models and assertions, within Dataform projects.
  2. Assertion Definition: Assertions in Dataform are expressed as SQL queries that define validation conditions for the data residing in BigQuery tables. These queries can range from simple checks, such as verifying the presence of required fields, to more complex validations, such as ensuring data consistency across multiple tables or adherence to specific business rules.
  3. Execution on BigQuery: When a Dataform pipeline is executed, either manually or on a scheduled basis, the assertions defined within the project are translated into SQL queries and executed directly on the BigQuery engine. This leverages the scalability and processing power of BigQuery to perform data quality checks at scale.
  4. Result Reporting: The results of assertion checks are captured within Dataform and can be accessed through its web interface or command-line interface (CLI). Users can view the status of assertion checks, including any failures or errors encountered during execution. Additionally, Dataform provides detailed logs and error messages to facilitate troubleshooting and resolution of data quality issues.
  5. Continuous Monitoring and Alerting: Dataform enables continuous monitoring of assertion results, allowing users to track the status of data quality checks over time. In the event of assertion failures or anomalies, users can set up alerts and notifications to proactively address data quality issues and maintain the integrity of their data pipelines.


How to implement monitoring and alerting? (The official way)

Google propagates monitoring and alerting for Dataform SQLX runs (including Dataform assertions) through its Cloud Logging and log-based alerts mechanisms.


Log Explorer - Filtering Dataform events and assertion related errors


Here's a summary of how Google facilitates these processes:

  1. Cloud Logging Integration: Google Cloud Logging seamlessly integrates with Dataform, allowing users to monitor and analyze logs generated by Dataform SQLX queries, including assertion checks. Dataform automatically logs information about query executions, including success, failure, and performance metrics, to Cloud Logging (see Log Explorer ).
  2. Log-Based Alerts: Users can set up log-based alerts in Google Cloud Monitoring to receive notifications based on specific events or conditions detected in Dataform logs. For instance, users can create alerts to trigger notifications when assertion failures occur , indicating potential data quality issues. This proactive alerting mechanism enables users to promptly address issues and maintain data integrity.
  3. Customized Alerting Rules: Google Cloud Monitoring provides flexibility in configuring alerting rules based on user-defined criteria. Users can customize alerting thresholds, conditions, and notification channels according to their specific monitoring requirements. This allows for tailored alerting mechanisms that align with organizational preferences and priorities.
  4. Centralized Monitoring Dashboard: Google Cloud Monitoring offers a centralized dashboard for viewing and managing alerts, logs, and metrics across the entire GCP environment, including Dataform SQLX executions. This unified interface provides a comprehensive overview of data pipeline health and performance, facilitating efficient monitoring and troubleshooting.
  5. Integration with Dataform Assertions: By leveraging Cloud Logging and log-based alerts, users can monitor the execution of Dataform assertions in real-time. This enables proactive detection of data quality issues, such as assertion failures, and triggers timely notifications to relevant stakeholders. As a result, organizations can uphold data quality standards and ensure the reliability of their data pipelines.


Logs-based alert policy - Based on the log query selected users can be notified via available channels (e.g. via email)


In summary, Google promotes the use of Cloud Logging and log-based alerts as the official way to monitor and alert on Dataform SQLX executions, including Dataform assertions. These mechanisms provide real-time visibility into data pipeline operations, enable proactive detection of issues, and facilitate timely response to maintain data integrity and reliability.


A "quicker" way to collect assert run information

Google Cloud Platform (GCP) provides a default view called INFORMATION_SCHEMA.JOBS within BigQuery (BQ) , which captures metadata about jobs that run on the BigQuery engine, including Dataform assertions. This default view allows users to access job-related information directly from BigQuery without the need to interact with Cloud Logging. Here's a description of the default view:

  1. Metadata Collection: The INFORMATION_SCHEMA.JOBS view aggregates metadata about jobs executed within BigQuery, including details such as job IDs, job types, start and end times, status, and resource usage metrics.
  2. Accessibility: Users can query the INFORMATION_SCHEMA.JOBS view like any other table or view within BigQuery. This provides a convenient and familiar interface for accessing job-related information directly from the BigQuery web UI, command-line tools, or client libraries without relying on external logging systems.
  3. Job Types: The default view captures information about various types of jobs executed in BigQuery, including query jobs, load jobs, extract jobs, copy jobs, and Dataform SQLX jobs (which encompass Dataform assertions). This allows users to monitor and analyze different types of job executions within a unified framework.
  4. Filtering and Analysis: Users can filter and analyze job-related information using SQL queries against the INFORMATION_SCHEMA.JOBS view. For instance, users can retrieve job statistics, identify long-running or failed jobs, track resource consumption trends, and monitor job execution patterns over time. Dataform assertions executed within BigQuery SQLX jobs are included in the metadata captured by the INFORMATION_SCHEMA.JOBS view. This means that users can query the view to retrieve information about assertion checks, including job IDs, execution times, statuses (success or failure), and other relevant attributes.
  5. Complementary to Cloud Logging: While the default view provides a convenient way to access job-related metadata within BigQuery, it complements rather than replaces Cloud Logging for comprehensive monitoring and alerting. Users can leverage both mechanisms in tandem to gain insights into job executions, track performance metrics, and detect anomalies effectively.

In summary, the INFORMATION_SCHEMA.JOBS view in BigQuery serves as a centralized repository for accessing metadata about jobs executed within the BigQuery engine, including Dataform assertions. By querying this default view, users can gain visibility into job execution history, monitor resource usage, and analyze job-related metrics directly within BigQuery, enhancing the monitoring and management of data transformation workflows.

For further insights into leveraging the metadata contained in the INFORMATION_SCHEMA.JOBS view in BigQuery, for different purposes, feel free to explore my previous article: "BigQuery Metadata: A Hidden and Often Undervalued Treasure in Data Management ".

What are the benefits of the "quicker approach" using just the official default BQ metadata tables?

Using the official default BigQuery metadata tables provides several benefits over relying solely on Cloud Logging for monitoring data transformations:

  1. Simplified Access: The default BigQuery metadata tables, such as INFORMATION_SCHEMA.JOBS, offer a straightforward and familiar interface for accessing job-related information directly within BigQuery. Users can query these tables using SQL statements without needing to interact with external logging systems or APIs.
  2. Integrated Workflow: By leveraging BigQuery's built-in metadata tables, monitoring becomes seamlessly integrated into the data transformation workflow. Users can access job-related metrics and status updates alongside their data processing tasks, streamlining the monitoring process and reducing context switching between different tools or interfaces.
  3. Reduced Latency: Accessing metadata directly from BigQuery's internal tables typically results in lower latency compared to querying data from external logging systems like Cloud Logging. This quicker access to job-related information enables faster insights and decision-making, particularly for real-time or near-real-time monitoring requirements.
  4. Cost-Efficiency: Querying the default BigQuery metadata tables may incur fewer costs compared to accessing log data stored in Cloud Logging, especially for large-scale or frequent monitoring tasks. Since metadata tables are part of the BigQuery service, there may be fewer data egress charges associated with accessing job-related information.
  5. Granular Control: BigQuery metadata tables provide granular control over the types of job-related information that users can retrieve. Users can customize SQL queries to filter, aggregate, and analyze metadata based on specific criteria, tailoring monitoring efforts to their unique requirements and preferences.
  6. Consistent Experience: Using BigQuery's metadata tables ensures a consistent monitoring experience for users within the BigQuery environment. Organizations can establish standardized monitoring practices and tooling around these built-in features, promoting consistency and alignment across teams and projects.
  7. Accessing more information and tailoring it to specific needs: Utilizing BigQuery's metadata provides access to a broader range of information compared to Log Explorer. Furthermore, by embedding information directly within the assertion view description/labels or within comments in the assertion query code (to be parsed out), additional fields can be added to the final table to reflect the status of assertion runs.

In summary, the quicker approach of utilizing the official default BigQuery metadata tables for monitoring offers simplicity, integration, reduced latency, cost-efficiency, granular control, and a consistent user experience. While Cloud Logging remains valuable for comprehensive monitoring and alerting needs, leveraging BigQuery's built-in metadata tables provides a pragmatic and efficient solution for monitoring data transformations directly within the BigQuery environment.


Example on how to filter


One way to create a base of data only on Assertion checks would be by implementing a periodic SQL code (e.g. Dataform pipeline | best if incremental) to query the metadata table with job information, isolate assertions, and build another smaller table can streamline monitoring and facilitate the creation of a dedicated dashboard for data quality assurance. Here's how this process can be structured and the benefits it offers:

  1. Scheduled Job to Query Metadata: Set up a scheduled function or job to periodically query the BigQuery metadata tables, such as INFORMATION_SCHEMA.JOBS, to retrieve information about job executions, including Dataform assertions. This function can run at predefined intervals (e.g., daily, every 6 hours) to ensure regular updates.
  2. Isolate Assertions: Within the scheduled function, filter the retrieved job information to isolate assertions specifically. Since assertions are SQL queries executed as part of Dataform SQLX jobs, you can identify them based on their job type or other metadata attributes that distinguish them from other types of jobs.
  3. Build Smaller Table: Use the filtered assertion information to populate another smaller table dedicated to storing essential details about each assertion check. This table should include key information such as timestamp of execution, success or failure status, the plain query of the assertion, description of the assertion, and the target table on which the assertion has been performed.

The resultant table can be connected to a dashboard for user exploration. Establishing this in-depth analysis of the real-time assertion checks requires minimal setup time and offers a comprehensive perspective, which can also be further enriched, on the performed data quality checks.


Draft dashboard sample - based on the JOB/assertion build up table out of default BQ metadata available information


Conclusion

In conclusion, as the importance of data quality checks continues to grow in the data lifecycle, ensuring accurate insights and informed decision-making becomes paramount. Dataform emerges as a robust solution, offering the capability to embed data quality checks, known as assertions, directly into data processing workflows.

While it is possible to monitor data assertions within GCP Dataform UI, which is primarily designed for check and fix failures, rather than comprehensive stats collection and clear overview as achievable in a dashboard, leveraging cloud logging and cloud monitoring is the propagated official way and it allow . However, a more efficient alternative lies in harnessing BigQuery's INFORMATION_SCHEMA.JOBS and extracting pertinent data from run assertions, along with their descriptions and key metrics.

This approach boasts several advantages, notably eliminating the need to configure and use cloud logs, as data are already stored in BigQuery. Additionally, it offers a familiar interface, requires no additional permissions, and facilitates seamless integration into existing workflows. By periodically extracting this subset of information, organizations can efficiently monitor data quality and empower stakeholders with a comprehensive dashboard for insightful analysis and decision-making.


References and Further Reading

To explore further into the realm of Dataform Assertion and various ways to utilize derived information, you may find the following resources valuable:


  1. GCP Test tables with assertions : The official documentation offers comprehensive insights into creating and utilizing test tables with assertions in Google Cloud Platform (GCP).
  2. View Cloud Logging for Dataform : The official GCP documentation provides guidance on viewing Cloud Logging for monitoring Dataform workflows and job executions.
  3. Configure alerts for failed workflow invocations : The official GCP documentation offers instructions on configuring alerts for detecting failed workflow invocations in Dataform exploiting other GCP tools.
  4. GCP BigQuery - JOBS view : The official GCP documentation provides information on utilizing the JOBS view in BigQuery to access metadata about job executions, including Dataform assertions.
  5. Technical and Business data alerts using Dataform : Explore articles on Medium related to this topic, offering practical insights and examples.
  6. A Comprehensive Guide to Error Handling and Alerting in Dataform : Find in-depth guides on Medium that cover error handling and alerting strategies in Dataform, providing comprehensive approaches and best practices.


要查看或添加评论,请登录

社区洞察

其他会员也浏览了