Maximizing BigQuery Efficiency: Using External Tables and Dataform for Cloud File Ingestion
In the realm of data, BigQuery (BQ) stands out as a cornerstone service within the Google Cloud Platform. Evolving beyond traditional querying, BQ now offers a diverse range of features, from advanced data visualization to predictive analytics.
However, amidst these advancements, the foundational step of data loading remains paramount. Before any analysis can commence, data must be efficiently ingested into BigQuery. Establishing robust data pipelines is essential, ensuring data quality, integrity, and security.
Besides various services capable of targeting BigQuery and loading data, whether through Google's native solutions (e.g. Dataflow, Dataprep, Cloud Data Fusion) or by leveraging BigQuery APIs, Google is simplifying the process of loading raw data stored in files within Cloud Services. This includes seamless integration with Google Cloud Storage, allowing for direct ingestion/view into BigQuery. This was already available since years, but recently Google is adding External table support for other cloud providers (e.g. Amazon S3, Azure Blob Storage, Apache Iceberg) furthermore it is also facilitating the orchestration of pipelines through tools like Dataform, all powered by the robust BigQuery engine at its core. This convergence of services streamlines the data loading process, enhancing efficiency and enabling organizations to harness the full potential of their data with ease.
What is a BQ External table and why is useful?
An external table in BigQuery, is a BQ resource that allows to query structured data that is stored outside of BigQuery itself. Instead of physically loading the data into BigQuery, you can create a table that references data stored in other data stores (such as Google Cloud Storage).
This feature was available already years ago and was initially thought to support GCP services. Supported Data Stores were the following:
Overtime, Google has expanded BigQuery's external table capability to include support for other cloud providers and open-source storage formats, enhancing interoperability and versatility. Users can now create external tables that reference data stored in services outside the GCP ecosystem, such as:
Moreover, Google has introduced BigQuery Omni, allowing users to query data across multiple cloud providers seamlessly. This service extends BigQuery's reach beyond GCP, enabling organizations to perform analytics on data stored in external sources across different clouds. Additionally, BigLake, Google's unified data lake storage solution, further integrates various data sources for efficient querying and analysis within BigQuery.
These enhancements empower organizations operating in multi-cloud environments or using diverse storage solutions, facilitating direct data analysis and reporting without the complexities of data migration.
From data loading to data cleansing and refinement: Dataform
Leveraging Dataform's capabilities, organizations can access data stored across multicloud environments and various file formats, seamlessly creating native tables or views directly in BigQuery. This integration simplifies the process of data cleansing, refinement, and preparation, enabling the creation of consolidated resources within BigQuery for streamlined analytics and reporting.
Dataform's workflow management features enhance efficiency by automating tasks such as data validation, transformation, and documentation, ensuring that data pipelines are robust and reliable.
By leveraging these tools, organizations can effectively utilize data from multicloud sources to derive actionable insights and drive informed decision-making.
A real example
Consider an external data provider that has been granted write access to a specific Google Cloud Storage (GCS) bucket, created specifically for their use. This provider will upload data as files, such as CSVs, to this bucket.
The data provider follows a specified file path and file naming convention for data delivery. Each day on its side, a script runs to extract transaction data from its system, maintaining a consistent structure. The script saves this data in a CSV file and uploads it to the GCS bucket using the correct naming format.
Currently, the data flows regularly in the GCS bucket and has not yet been imported into BigQuery (BQ). The goal is to make this data available in BQ in a well-managed manner, enabling its use for various purposes such as dashboarding and analytics.
The following sub-sections will present and delve into the different, possible, steps required to achieve a comprehensive data preparation process.
External Table - Data Exploration and reference
An external table in BigQuery allows you to query data stored outside of BigQuery, such as in Google Cloud Storage (GCS), without loading the data into BigQuery. To maximize the efficiency and utility of external tables, certain preconditions and best practices should be observed.
Within this specific use case, CSV files are loaded into GCS to facilitate the ingestion of data following some established conventions. These conventions can add significant value through the follow-up data preparation framework.
Key characteristics that facilitate data loading into BigQuery (BQ) from GCS include:
To explore the data in an External table within BQ this just need to be created. It can be done in different ways - here the manual approach is presented:
In the above image, you can see the configuration of the external table in BigQuery when creating it manually via the UI (the same can be achieved using SQL code). Please note the following important points:
Additionally, there are more options available under Advanced settings, such as changing the standard field separator, allowing for field encapsulation, handling incomplete rows, error acceptance, and other settings. These can help manage various deviations from the ideal file configuration.
Below you can see the result in BigQuery.
In the above image, you can see that the table has been created in BigQuery. It does not consume storage, and the pointer to the GCS filepath is listed.
Additionally, the schema has been generated based on a subset of the records processed. Different field types are auto-detected and imposed in the schema by BigQuery itself. As a note, the schema could have been provided as input during the creation process instead of auto-detecting it.
Sometimes, it is advisable to define the schema manually to avoid errors. The external table is just a reference, and when querying, if any real data within the files deviate from the defined schema, the query will fail.
For instance, consider a field detected as an INT, such as sales quantity. If, due to a bad manual input, one out of a million records in the CSV files contains a number with a comma, like 1.1123454, this will cause the query to fail.
To handle such cases, one option is to declare the field as STRING and, in the subsequent data preparation step, use a SAFE_CAST to transform it into an INT. This approach allows for managing any bad data present in the file according to the company's data governance policies.
Below an example done on the External table with sales quantity set as string, as it was not possible to set it as INT even if this would be the detected field format (as 99.999...% of the field would correctly fit in it).
Based on the confidence in the fields' formats present in the CSV files, the schema needs to be declared appropriately to avoid possible errors. When there is uncertainty, it is advisable to use STRING and apply post-processing conversion. This method ensures that any unexpected data formats, such as a numeric field containing a comma or wrongly formatted dates, do not cause the table to fail. Conversely, if the system providing the data has robust data validation, more specific field types can be confidently used.
At the end of this process, the external table is ready. It will always have the most up-to-date data, as it reads live from the CSV files, and can be used as a starting point for subsequent data preparation operations.
Requirement definition for data preparation
The next step involves defining the target schema, which should result from processing the data to prepare it for future use. This schema should include:
Additional information needed includes:
Other meaningful requirements may be needed based on different scenarios.
领英推荐
This analysis and requirements gathering is essential to create the specifications for the developer who will code this step, starting from the created external table.
By establishing these requirements, the developer will have a clear understanding of how to transform and prepare the data for subsequent usage, ensuring consistency, accuracy, and usability in future analyses.
Data Preparation with Cleansing/Enhancing procedure
The requirements for transforming and preparing data can be efficiently managed using a Dataform SQLX file. In Dataform, the external table serves as the starting point for these operations. The SQLX file allows you to define various tasks such as field renaming, adding descriptions, and generating new fields.
These transformations are executed through different components of the SQLX structure. The config block can be used to set up table properties and configurations, while the SQL statement block is where you perform the actual data processing. For instance, to handle field renaming and type casting, you can write SQL statements within the block that align with your target schema requirements.
An important addition to any data processing step is tracking the source of each record. By leveraging the external table, you can create a field such as _FILE_NAME AS data_source__full_gcs_path. This field helps identify the origin of the data, making it easier to spot anomalies and maintain a clear data lineage.
Dataform’s flexibility supports both hardcoding the necessary steps and using reusable components. For example, you can define constant values and functions in constant.js or create specific functions within the Include folder for casting and safe parsing (even better to adopt NPM packages resusable across repositories). By defining these functions once, you can ensure consistent data processing practices and reduce management overhead.
Overall, Dataform provides a robust framework to implement and manage the transformations required for preparing your data, allowing for efficient and scalable data processing workflows.
config {
type: 'table',
tags: ['...','...'],
description: "....",
bigquery: {
partitionBy: '...',
labels: {
responsible: "...",
usage: "...",
source_system: "...",
}
},
columns: {
field: "description",
field2: "description",
field3: "description",
field4: "description",
field5: "Description",
data_source__full_gcs_path: "Full GCS path of the file where the record originates"
}
}
/*
############ Description:
- ...
############ Author:
- ...
- Date Created: [....]
############ Version History: (Optional)
- Version 1.0: [24/05/2024]
- Initial query creation.
############ Usage:
- TBD
############ Labels (for usage tracking, debugging, cost monitoring, cost attribution)
[Start] Internal Label [Start]
Project | ...
Priority | ...
Team | ...
... | ...
[End] Internal Label [End]
*/
SELECT
...,
SAFE.PARSE_DATETIME('%Y-%m-%d', DATE) AS DATE,
...,
SAFE_CAST(Sales_Qty AS INT64) AS Sales_Qty,
...,
_FILE_NAME AS data_source__full_gcs_path
FROM ${ref('0001__customer_transactions__external_table_test')}
In this proposed snippet, information are hardcoded and no reference is used to constant.js or general function defined in javascript which might be used. It is a simple version that still provides insights. Here a split of the different sections:
Test and publishing
BigQuery’s table cloning feature presents a powerful alternative for managing and publishing tables across different datasets. This capability allows for maintaining the development and data preparation processes within a specific dataset (available only to data preparation developers) and then cloning the prepared table to another dataset, which is accessible to data analysts, BI users, or other stakeholders that manage downstream operation on cleaned/prepared data.
Advantages of Table Cloning
This result can be achieved with a post-operation part of the .slqx where the data preparation tasks are performed; using also different table suffix or dataset destination in the case of a test release or of a prod release.
post_operations {
${when(dataform.projectConfig.vars.executionSetting === 'test',
` CREATE OR REPLACE TABLE project.dataset.0001__customer_transactions__v_dataform_test
CLONE ${self()};`,
''
)}
${when(dataform.projectConfig.vars.executionSetting === 'prod',
` CREATE OR REPLACE TABLE project.dataset.0001__customer_transactions
CLONE ${self()};`,
''
)
}
}
A nicer way of doing this is to create a function that perform the clone on specific subset of tables managed by Dataform. This can be achieved by declaring some constant information and a function that dynamically (i.e., for each specified table in a specifically created clones.js - use any name that may work for you) perform the cloning operation as a dependency of the data-preparation.
const TARGET_DATASETS = [
{ prefix: 'my_target_dataset___', target: 'my_target_dataset' },
{ prefix: '...', target: '...' },
{ prefix: '...', target: '...' }
]
const TABLES = [
{ name: "my_target_dataset___0001__customer_transactions", author: 'my_ID'},
{ name: "my_target_dataset___0001__stores_....", author: 'my_ID'},
]
module.exports = { TABLES, TARGET_DATASETS }
And then the related function:
const tableNames = clones.TABLES
tableNames.forEach(({name,author}) => {
const dataset = clones.TARGET_DATASETS.find((dataset)=> name.indexOf(dataset.prefix) === 0)
if (dataset && (dataform.projectConfig.vars.executionSetting === 'prod' || dataform.projectConfig.vars.executionSetting === 'test')) {
const projectId = 'my_Project'
const dataset_name = dataset.target
const table_name = name.replace(dataset.prefix, '')
operate('Clone_Operation_' + name ).queries(ctx => `
/*
############ Description:
- This scripts takes care of cloning tables based on their final name within Dataform PROD environment. It will loop on them and given certain criteria it will clone these tables in the right dataset for usage (based on prefix)
############ Author:
- ${author} (Contact Information)
- Date Created: [...]
############ Labels (for usage tracking, debugging, cost monitoring, cost attribution)
[Start] Internal Label [Start]
Project | ...
Priority | High
Team | ...
User | ${author}
Functionality | Cloning
[End] Internal Label [End]
*/
CREATE OR REPLACE TABLE ${projectId}.${dataset_name}.${table_name}${dataform.projectConfig.vars.executionSetting === 'test' ? '__v_dataform_test' : ''}
CLONE ${ctx.ref(name)}; `);
} else if (dataform.projectConfig.vars.executionSetting === 'dev') {
const projectId = 'my_Project'
const dataset_name = dataset.target
const table_name = name.replace(dataset.prefix, '')
operate('dummy_clone_' + name).queries(ctx => `
SELECT "this is a dummy operation for dev env to replace the colne operation done in test/prod. When in prod it will be cloned to ${projectId}.${dataset_name}.${table_name}. When in test it will be cloned to ${projectId}.${dataset_name}.${table_name}__v_dataform_test" FROM ${ctx.ref(name)};
`);
}
});
The result of this script will be visible on the compiled graph in Dataform, an operation "Clone...." will be visible for all the tables that have been listed in the clones.js file. This just writing one single time the code for it and updating just the clones.js list of table where to apply this operation.
Data preparation triggering
A lightweight approach to automate the data processing workflow using Dataform when a new file is loaded to a specific bucket, you can leverage Google Cloud Functions in conjunction with the Dataform API. Here’s how this can be set up:
The cloud function then need the code executed when the trigger comes - this is basically an API call. Dataform workflows can be initiated via API call, please check the related documentation here: Dataform API ?|? Google Cloud
One way to do so is to create in Dataform a workflow configuration regarding the data preparation to be executed, similarly to the one presented in the image below:
The parameters that needs to be defined to create a workflow config are the followings:
Once these parameters have been defined the API call to the related endpoint can be made using the coding language of your choise within Cloud Function.
Endpoint:
Payload:
{"workflowConfig":"projects/PROJECT_ID/locations/REGION/repositories/REPO/workflowConfigs/configuration_id" }
By integrating Google Cloud Functions with Dataform via the API, you can create an automated data processing pipeline that responds in real-time to new files added to a GCS bucket. This approach enhances operational efficiency by automating the triggering of Dataform workflows and ensures timely processing of incoming data. For more details on setting up Cloud Functions and interacting with Dataform via API, refer to the Google Cloud Functions documentation and the Dataform API documentation.
Based on different needs and going for a more structured approach different options are available to orchestrate operation within GCP, such as Cloud Workflows or Cloud Composer (Airflow).
Conclusions
Combining the robust services and functionalities offered by Google Cloud Platform (GCP) creates a highly efficient and scalable data processing workflow. File delivery in Google Cloud Storage (GCS) ensures easy delivery by the data provider. By leveraging external tables in BigQuery, you can access this data without the need for initial loading, maintaining a seamless and up-to-date data source. Dataform streamlines data preparation within BigQuery, providing a structured and manageable way to transform raw data into clean and standardized data ready to lead at valuable insights. Cloud Functions, triggered by new file uploads to GCS and making API calls to Dataform, automate the entire process, ensuring that data preparation workflows are executed promptly and accurately. Cloud Workflow or Cloud Composer can be the right candidate for an overall orchestration if needed.
This methodology scales with low effort with your data volume and complexity, as GCP's serverless architecture handles the underlying infrastructure. Moreover, table cloning in BigQuery allows for efficient and cost-effective data publishing, enabling data analysts and BI users to access the latest data without incurring significant storage costs. By maintaining development operations in a dedicated dataset and cloning tables to production environments, organizations can ensure data integrity and manage costs effectively.
Overall, this integrated GCP solution offers a powerful combination of reliability, scalability, and cost-efficiency, making it an ideal choice for several use cases that include data ingestion, data management/refinement to enable analytics needs.
References and Further Reading