Maximizing BigQuery Efficiency: Using External Tables and Dataform for Cloud File Ingestion
Image partially generated with MS Co-Pilot GenAI (Designer Powered by DALL·E 3)

Maximizing BigQuery Efficiency: Using External Tables and Dataform for Cloud File Ingestion

In the realm of data, BigQuery (BQ) stands out as a cornerstone service within the Google Cloud Platform. Evolving beyond traditional querying, BQ now offers a diverse range of features, from advanced data visualization to predictive analytics.

However, amidst these advancements, the foundational step of data loading remains paramount. Before any analysis can commence, data must be efficiently ingested into BigQuery. Establishing robust data pipelines is essential, ensuring data quality, integrity, and security.

Besides various services capable of targeting BigQuery and loading data, whether through Google's native solutions (e.g. Dataflow, Dataprep, Cloud Data Fusion) or by leveraging BigQuery APIs, Google is simplifying the process of loading raw data stored in files within Cloud Services. This includes seamless integration with Google Cloud Storage, allowing for direct ingestion/view into BigQuery. This was already available since years, but recently Google is adding External table support for other cloud providers (e.g. Amazon S3, Azure Blob Storage, Apache Iceberg) furthermore it is also facilitating the orchestration of pipelines through tools like Dataform, all powered by the robust BigQuery engine at its core. This convergence of services streamlines the data loading process, enhancing efficiency and enabling organizations to harness the full potential of their data with ease.


What is a BQ External table and why is useful?

An external table in BigQuery, is a BQ resource that allows to query structured data that is stored outside of BigQuery itself. Instead of physically loading the data into BigQuery, you can create a table that references data stored in other data stores (such as Google Cloud Storage).

This feature was available already years ago and was initially thought to support GCP services. Supported Data Stores were the following:

  • Cloud Storage: Query data stored in Cloud Storage in formats like CSV, JSON, Avro, ORC, and Parquet.
  • Bigtable: Query data stored in Google’s sparsely populated NoSQL database.
  • Google Drive: Query data stored in Google Drive.

Overtime, Google has expanded BigQuery's external table capability to include support for other cloud providers and open-source storage formats, enhancing interoperability and versatility. Users can now create external tables that reference data stored in services outside the GCP ecosystem, such as:

  • Amazon S3: Query data stored in Amazon's Simple Storage Service (S3).
  • Azure Blob Storage: Query data stored in Microsoft Azure's Blob Storage.
  • Apache Iceberg: Query data stored using Apache Iceberg, an open table format for massive analytic datasets.

Moreover, Google has introduced BigQuery Omni, allowing users to query data across multiple cloud providers seamlessly. This service extends BigQuery's reach beyond GCP, enabling organizations to perform analytics on data stored in external sources across different clouds. Additionally, BigLake, Google's unified data lake storage solution, further integrates various data sources for efficient querying and analysis within BigQuery.

These enhancements empower organizations operating in multi-cloud environments or using diverse storage solutions, facilitating direct data analysis and reporting without the complexities of data migration.


From data loading to data cleansing and refinement: Dataform

Leveraging Dataform's capabilities, organizations can access data stored across multicloud environments and various file formats, seamlessly creating native tables or views directly in BigQuery. This integration simplifies the process of data cleansing, refinement, and preparation, enabling the creation of consolidated resources within BigQuery for streamlined analytics and reporting.

Dataform's workflow management features enhance efficiency by automating tasks such as data validation, transformation, and documentation, ensuring that data pipelines are robust and reliable.

By leveraging these tools, organizations can effectively utilize data from multicloud sources to derive actionable insights and drive informed decision-making.

A real example

Consider an external data provider that has been granted write access to a specific Google Cloud Storage (GCS) bucket, created specifically for their use. This provider will upload data as files, such as CSVs, to this bucket.

The data provider follows a specified file path and file naming convention for data delivery. Each day on its side, a script runs to extract transaction data from its system, maintaining a consistent structure. The script saves this data in a CSV file and uploads it to the GCS bucket using the correct naming format.

Currently, the data flows regularly in the GCS bucket and has not yet been imported into BigQuery (BQ). The goal is to make this data available in BQ in a well-managed manner, enabling its use for various purposes such as dashboarding and analytics.

GCS Bucket where CSV files are delivered following file path and filename established convention


The following sub-sections will present and delve into the different, possible, steps required to achieve a comprehensive data preparation process.


External Table - Data Exploration and reference

An external table in BigQuery allows you to query data stored outside of BigQuery, such as in Google Cloud Storage (GCS), without loading the data into BigQuery. To maximize the efficiency and utility of external tables, certain preconditions and best practices should be observed.

Within this specific use case, CSV files are loaded into GCS to facilitate the ingestion of data following some established conventions. These conventions can add significant value through the follow-up data preparation framework.

Key characteristics that facilitate data loading into BigQuery (BQ) from GCS include:

  1. Common GCS Radix Path: Files with the same record structure and topic representation are delivered to specific GCS file paths. For example, all files related to a specific topic (e.g., transactions) are loaded within gs://my_bucket/my_subpath/.... In the use case presented, the filepath follows a standard format: Bucket/source_id/pipeline_id/pipeline_version/year/file. All files under a pipeline_version folder must have the same record set, meaning the schema must be consistent. If a schema change is needed (e.g., additional fields), a new version and history re-extraction are required.
  2. Common GCS File Name Structure: Files with the same record structure and topic representation are delivered with a structured file name that conveys useful information which can be used for loading purpose (without the need of deep dive in the content of the file itself). For instance, files can be named to allow processing within a specific timeframe (e.g., 0001__transaction__dd__20240501.csv). In the presented use case, filenames follow a standard convention: pipeline_ID separator pipeline_name separator covered_timeframe_unit separator covered_timeframe.
  3. Common Schema: Files with the same record structure and topic representation are delivered with a consistent internal record structure. For instance, if a record has three fields, there must always be two separators. If a header is included, it should be consistent in every file, with the same values. If the first field is a date formatted as 2024-05-01, this format must be maintained consistently across all files without exceptions (e.g., avoiding a mix with 01/05/2024).
  4. Common Filetype and Formatting: Files with the same record structure and topic representation should be delivered in a consistent file type and encoding. For example, if one file is a CSV, all files should be CSVs with the same encoding. The same principle applies to other common file types used for data delivery.
  5. Header with Field Names: Every file follows the same pattern, meaning it either includes or excludes a header uniformly. If a header is present, it simplifies the process if the field names conform to BQ standards. Note that many field names from systems like BW may start with "0", which is not a valid starting character for field names in BigQuery. Additionally, other special characters (e.g., "(", "-", etc.) should be avoided. For more information, refer to the Lexical structure and syntax guidelines for BigQuery.
  6. Data Freshness Indicator: The external data provider adds a field in every CSV file that reports the extraction_data_date_time, a timestamp indicating when the data was extracted and written to the CSV file. This timestamp is the same for all records in the CSV file, providing a clear indicator of data freshness.

To explore the data in an External table within BQ this just need to be created. It can be done in different ways - here the manual approach is presented:


Manual creation of an External table in BQ - Pointing to a specific file path and by auto-detecting the schema

In the above image, you can see the configuration of the external table in BigQuery when creating it manually via the UI (the same can be achieved using SQL code). Please note the following important points:

  • The wildcard (*) used at the end of the file path specification in GCS will select all files with a common file path prefix. This is possible only because it has been ensured by design that all files following that path have the same structure.
  • The auto-detect schema feature can be used because it has been ensured by design that all files have the same schema structure and field formatting.
  • The auto-detect schema feature can be used because it has been ensured by design that all files have the same headers, maintain field order, and use valid BigQuery field names.

Additionally, there are more options available under Advanced settings, such as changing the standard field separator, allowing for field encapsulation, handling incomplete rows, error acceptance, and other settings. These can help manage various deviations from the ideal file configuration.

Below you can see the result in BigQuery.

Details of the created External table in BigQuery


Auto-detected Schema for the created External table in BigQuery


In the above image, you can see that the table has been created in BigQuery. It does not consume storage, and the pointer to the GCS filepath is listed.

Additionally, the schema has been generated based on a subset of the records processed. Different field types are auto-detected and imposed in the schema by BigQuery itself. As a note, the schema could have been provided as input during the creation process instead of auto-detecting it.

Sometimes, it is advisable to define the schema manually to avoid errors. The external table is just a reference, and when querying, if any real data within the files deviate from the defined schema, the query will fail.

For instance, consider a field detected as an INT, such as sales quantity. If, due to a bad manual input, one out of a million records in the CSV files contains a number with a comma, like 1.1123454, this will cause the query to fail.

To handle such cases, one option is to declare the field as STRING and, in the subsequent data preparation step, use a SAFE_CAST to transform it into an INT. This approach allows for managing any bad data present in the file according to the company's data governance policies.

Below an example done on the External table with sales quantity set as string, as it was not possible to set it as INT even if this would be the detected field format (as 99.999...% of the field would correctly fit in it).

Debug of a theoretically INT field not fitting in INT format - External table defined with the field as STRING

Based on the confidence in the fields' formats present in the CSV files, the schema needs to be declared appropriately to avoid possible errors. When there is uncertainty, it is advisable to use STRING and apply post-processing conversion. This method ensures that any unexpected data formats, such as a numeric field containing a comma or wrongly formatted dates, do not cause the table to fail. Conversely, if the system providing the data has robust data validation, more specific field types can be confidently used.

At the end of this process, the external table is ready. It will always have the most up-to-date data, as it reads live from the CSV files, and can be used as a starting point for subsequent data preparation operations.


Requirement definition for data preparation

The next step involves defining the target schema, which should result from processing the data to prepare it for future use. This schema should include:

  1. Field Renaming: If the headers in the given files are not suitable, rename the fields accordingly.
  2. Field Types: Specify new field types for any fields that can be safely cast or parsed.
  3. Field Descriptions: Provide descriptions for field names if needed.
  4. New Fields: Identify any new fields to be generated, such as enhancements or combinations of existing fields.

Additional information needed includes:

  • Target Table Destination: Specify where the processed data will be stored.
  • Update Methodology: Define how the table will be updated (e.g., table recreation, incremental updates).
  • Metadata: Include descriptions, labels, and any other relevant details for the target table.

Other meaningful requirements may be needed based on different scenarios.

This analysis and requirements gathering is essential to create the specifications for the developer who will code this step, starting from the created external table.

By establishing these requirements, the developer will have a clear understanding of how to transform and prepare the data for subsequent usage, ensuring consistency, accuracy, and usability in future analyses.

Data Preparation with Cleansing/Enhancing procedure

The requirements for transforming and preparing data can be efficiently managed using a Dataform SQLX file. In Dataform, the external table serves as the starting point for these operations. The SQLX file allows you to define various tasks such as field renaming, adding descriptions, and generating new fields.

These transformations are executed through different components of the SQLX structure. The config block can be used to set up table properties and configurations, while the SQL statement block is where you perform the actual data processing. For instance, to handle field renaming and type casting, you can write SQL statements within the block that align with your target schema requirements.

An important addition to any data processing step is tracking the source of each record. By leveraging the external table, you can create a field such as _FILE_NAME AS data_source__full_gcs_path. This field helps identify the origin of the data, making it easier to spot anomalies and maintain a clear data lineage.

Dataform’s flexibility supports both hardcoding the necessary steps and using reusable components. For example, you can define constant values and functions in constant.js or create specific functions within the Include folder for casting and safe parsing (even better to adopt NPM packages resusable across repositories). By defining these functions once, you can ensure consistent data processing practices and reduce management overhead.

Overall, Dataform provides a robust framework to implement and manage the transformations required for preparing your data, allowing for efficient and scalable data processing workflows.

config { 
    type: 'table',
    tags: ['...','...'],
    description: "....",
    bigquery: {
        partitionBy: '...',
        labels: {
            responsible: "...",
            usage: "...",
            source_system: "...",
        }

    },
    columns: {
        field: "description",
        field2: "description",        
        field3: "description",
        field4: "description",
        field5: "Description",
        data_source__full_gcs_path: "Full GCS path of the file where the record originates" 
    }
}

  /*
  ############  Description:
  - ...

  ############  Author:
  - ...
  - Date Created: [....]

  ############  Version History: (Optional)
  - Version 1.0: [24/05/2024]
  - Initial query creation.


  ############  Usage:
  - TBD

  ############  Labels (for usage tracking, debugging, cost monitoring, cost attribution)
  [Start] Internal Label [Start]
  Project | ...
  Priority | ...
  Team | ...
  ... | ...
  [End] Internal Label [End]

  */


SELECT
...,
SAFE.PARSE_DATETIME('%Y-%m-%d', DATE) AS DATE,
...,

SAFE_CAST(Sales_Qty AS INT64) AS Sales_Qty,
...,
_FILE_NAME AS data_source__full_gcs_path 
FROM ${ref('0001__customer_transactions__external_table_test')}        

In this proposed snippet, information are hardcoded and no reference is used to constant.js or general function defined in javascript which might be used. It is a simple version that still provides insights. Here a split of the different sections:


  1. Configuration Block: The config block defines the properties and metadata for the BigQuery table, including partitioning, labels, and descriptions. This setup is crucial for organizing and managing the data table effectively.
  2. Comments Section: The comments section provides detailed documentation about the SQLX file, including descriptions, authorship, version history, and usage instructions. This documentation is essential for maintaining and understanding the code. Please refer to another article I publish to understand the importance of creating these artificial internal BQ JOB labels and how to exploit it: Labeling Strategies for BigQuery Mastery: Enhancing Efficiency and Cost Management
  3. SQL Query Block: The SQL query performs data transformations such as parsing dates and casting fields to appropriate types, ensuring data consistency and handling potential errors. Also new field can be added as well as not needed field left out. The inclusion of _FILE_NAME AS data_source__full_gcs_path helps in creating a lineage for the data, making it easier to trace back to the original file in GCS. This feature is particularly useful for identifying anomalies and understanding the data’s origin.


Test and publishing

BigQuery’s table cloning feature presents a powerful alternative for managing and publishing tables across different datasets. This capability allows for maintaining the development and data preparation processes within a specific dataset (available only to data preparation developers) and then cloning the prepared table to another dataset, which is accessible to data analysts, BI users, or other stakeholders that manage downstream operation on cleaned/prepared data.

Advantages of Table Cloning

  1. Efficient Data Management: Isolated Development: Keep all development and data preparation processes in a dedicated dataset, minimizing the risk of accidental changes to production data. Seamless Publishing: Once data preparation is complete, clone the table to a separate dataset used for analysis and reporting. This separation ensures that end-users access only the finalized data.
  2. Cost-Effectiveness: Inexpensive: Table cloning in BigQuery is cost-efficient as long as the cloned table is not modified. The clone shares the underlying storage with the original table, avoiding duplication costs. Optimized Resource Utilization: Changes to the cloned table will only incur costs for the altered data, making it a scalable solution for managing large datasets.
  3. Version Control and Data Integrity: Snapshots of Data: Cloning allows creating snapshots of the data at specific points in time, ensuring that users always have access to consistent and accurate data. Audit and Lineage: Cloned tables help maintain clear data lineage and auditing, making it easier to track changes and ensure data integrity.


This result can be achieved with a post-operation part of the .slqx where the data preparation tasks are performed; using also different table suffix or dataset destination in the case of a test release or of a prod release.

post_operations {
  ${when(dataform.projectConfig.vars.executionSetting === 'test',
 ` CREATE OR REPLACE TABLE project.dataset.0001__customer_transactions__v_dataform_test
  CLONE ${self()};`,
  ''
  )}
  ${when(dataform.projectConfig.vars.executionSetting === 'prod',
 ` CREATE OR REPLACE TABLE project.dataset.0001__customer_transactions
  CLONE ${self()};`,
  ''
  )
  
  }
}        


A nicer way of doing this is to create a function that perform the clone on specific subset of tables managed by Dataform. This can be achieved by declaring some constant information and a function that dynamically (i.e., for each specified table in a specifically created clones.js - use any name that may work for you) perform the cloning operation as a dependency of the data-preparation.

const TARGET_DATASETS = [
  { prefix: 'my_target_dataset___', target: 'my_target_dataset' },
  { prefix: '...', target: '...' },
  { prefix: '...', target: '...' }
]

const TABLES = [
    { name: "my_target_dataset___0001__customer_transactions",            author: 'my_ID'},
    { name: "my_target_dataset___0001__stores_....",        author: 'my_ID'},

]

module.exports = { TABLES, TARGET_DATASETS }        

And then the related function:

const tableNames = clones.TABLES


tableNames.forEach(({name,author}) => {

    const dataset = clones.TARGET_DATASETS.find((dataset)=> name.indexOf(dataset.prefix) === 0)

    if (dataset && (dataform.projectConfig.vars.executionSetting === 'prod' || dataform.projectConfig.vars.executionSetting === 'test')) {

      const projectId = 'my_Project'
      const dataset_name = dataset.target
      const table_name = name.replace(dataset.prefix, '')

      operate('Clone_Operation_' + name ).queries(ctx => `

        /*
             ############  Description:

             - This scripts takes care of cloning tables based on their final name within Dataform PROD environment. It will loop on them and given certain criteria it will clone these tables in the right dataset for usage (based on prefix)


             ############  Author: 
             - ${author} (Contact Information) 
             - Date Created: [...]


             ############  Labels (for usage tracking, debugging, cost monitoring, cost attribution)
             [Start] Internal Label [Start] 
             Project | ...
             Priority | High
             Team | ...
             User | ${author}
             Functionality | Cloning 
             [End] Internal Label [End]

        */


        CREATE OR REPLACE TABLE ${projectId}.${dataset_name}.${table_name}${dataform.projectConfig.vars.executionSetting === 'test' ? '__v_dataform_test' : ''}
        CLONE ${ctx.ref(name)}; `);
    } else if (dataform.projectConfig.vars.executionSetting === 'dev') {
        
      const projectId = 'my_Project'
      const dataset_name = dataset.target
      const table_name = name.replace(dataset.prefix, '')
        
        operate('dummy_clone_' + name).queries(ctx => `
            
            SELECT "this is a dummy operation for dev env to replace the colne operation done in test/prod. When in prod it will be cloned to ${projectId}.${dataset_name}.${table_name}. When in test it will be cloned to ${projectId}.${dataset_name}.${table_name}__v_dataform_test" FROM ${ctx.ref(name)}; 
        `);
    }
});        

The result of this script will be visible on the compiled graph in Dataform, an operation "Clone...." will be visible for all the tables that have been listed in the clones.js file. This just writing one single time the code for it and updating just the clones.js list of table where to apply this operation.

E2E flow represented in the Compiled Graph (Dataform)


Data preparation triggering

A lightweight approach to automate the data processing workflow using Dataform when a new file is loaded to a specific bucket, you can leverage Google Cloud Functions in conjunction with the Dataform API. Here’s how this can be set up:

  1. Triggering the Cloud Function: Configure a Cloud Function to listen for events in the Google Cloud Storage (GCS) bucket. This can be achieved by setting up an event notification that triggers the Cloud Function whenever a new file is uploaded to the bucket. Cloud Functions are serverless and respond to changes in GCS by automatically executing your code.
  2. Processing the Trigger: When the Cloud Function is triggered by a new file upload event, it processes the event information to determine which file has been added. This can include extracting the file name, path, or other relevant metadata.
  3. Calling the Dataform API: The Cloud Function then makes an HTTP request to the Dataform API to initiate a data processing workflow. This API request can trigger specific Dataform workflows or actions, such as running a data transformation pipeline. You can pass parameters or specify configurations needed for the Dataform job.
  4. Dataform Workflow Execution: Once the Dataform API receives the request, it will execute the defined SQLX workflows to process the newly added file. This involves running SQL transformations, updating tables, and applying any specified data processing steps.
  5. Monitoring and Feedback: You can also set up monitoring and logging within the Cloud Function and Dataform to track the status of the data processing tasks. This ensures that any issues or failures are promptly identified and addressed.


Trigger configuration within the Cloud Function


The cloud function then need the code executed when the trigger comes - this is basically an API call. Dataform workflows can be initiated via API call, please check the related documentation here: Dataform API ?|? Google Cloud

One way to do so is to create in Dataform a workflow configuration regarding the data preparation to be executed, similarly to the one presented in the image below:

Workflow Configuration in Dataform (example)


The parameters that needs to be defined to create a workflow config are the followings:

  • configuration_id
  • release configuration
  • tags (or other that defines the set of .sqlx to be executed/orchestrated)

Once these parameters have been defined the API call to the related endpoint can be made using the coding language of your choise within Cloud Function.

Endpoint:

https://dataform.googleapis.com/v1beta1/projects/{PROJECT_ID}/locations/{LOCATION}/repositories/{REPO}/workflowInvocations

Payload:

{"workflowConfig":"projects/PROJECT_ID/locations/REGION/repositories/REPO/workflowConfigs/configuration_id" }


By integrating Google Cloud Functions with Dataform via the API, you can create an automated data processing pipeline that responds in real-time to new files added to a GCS bucket. This approach enhances operational efficiency by automating the triggering of Dataform workflows and ensures timely processing of incoming data. For more details on setting up Cloud Functions and interacting with Dataform via API, refer to the Google Cloud Functions documentation and the Dataform API documentation.

Based on different needs and going for a more structured approach different options are available to orchestrate operation within GCP, such as Cloud Workflows or Cloud Composer (Airflow).


Conclusions

Combining the robust services and functionalities offered by Google Cloud Platform (GCP) creates a highly efficient and scalable data processing workflow. File delivery in Google Cloud Storage (GCS) ensures easy delivery by the data provider. By leveraging external tables in BigQuery, you can access this data without the need for initial loading, maintaining a seamless and up-to-date data source. Dataform streamlines data preparation within BigQuery, providing a structured and manageable way to transform raw data into clean and standardized data ready to lead at valuable insights. Cloud Functions, triggered by new file uploads to GCS and making API calls to Dataform, automate the entire process, ensuring that data preparation workflows are executed promptly and accurately. Cloud Workflow or Cloud Composer can be the right candidate for an overall orchestration if needed.

This methodology scales with low effort with your data volume and complexity, as GCP's serverless architecture handles the underlying infrastructure. Moreover, table cloning in BigQuery allows for efficient and cost-effective data publishing, enabling data analysts and BI users to access the latest data without incurring significant storage costs. By maintaining development operations in a dedicated dataset and cloning tables to production environments, organizations can ensure data integrity and manage costs effectively.

Overall, this integrated GCP solution offers a powerful combination of reliability, scalability, and cost-efficiency, making it an ideal choice for several use cases that include data ingestion, data management/refinement to enable analytics needs.

References and Further Reading


  1. Google Cloud Documentation - BigQuery External Data Sources: This official documentation provides an overview of how to create and manage external tables in BigQuery, including details on supported data formats and sources.
  2. Google Cloud Blog - Announcing General Availability of BigQuery Omni: This blog post details the general availability of BigQuery Omni and its capabilities, a service that allows querying data across multiple cloud providers.
  3. Google Cloud Documentation - BigQuery Omni: This documentation explains how to use BigQuery Omni to query data stored in other cloud providers like Amazon S3 and Azure Blob Storage.
  4. Google Cloud Next Presentation - BigQuery Omni: Data Analytics Across Clouds: Big news for BigQuery: We’re anchoring on BigQuery as our unified data analytics platform, including across clouds via?BigQuery Omni. BigQuery also gets a new data canvas — a new natural language-based experience for data exploration, curation, wrangling, analysis, and visualization workflows. A presentation from Google Cloud Next that covers the capabilities of BigQuery Omni, including its support for querying data in external cloud storage.


要查看或添加评论,请登录

Steven Tait的更多文章

社区洞察

其他会员也浏览了