登录查看更多内容

Setting Up dbt Core on GCP: A Step-by-Step Guide

Sunil Rastogi

AWS/GCP Solutions Architect||Data Engineer||Python||Scala||Spark||Big Data||Snowflake||Freelancer

发布日期: 2025年1月9日

Deploying dbt Core on Google Cloud Platform (GCP) allows you to centralize and scale your data transformation workflows without relying on local environments. This guide will walk you through the steps to set up dbt Core directly on GCP, ensuring a cloud-native deployment that integrates seamlessly with GCP services.

1. What is dbt Core?

dbt Core (Data Build Tool) is an open-source tool designed to help analytics engineers build, test, and deploy data transformations using modular SQL. By leveraging dbt Core, teams can apply software engineering principles such as version control, testing, and modularity to data workflows.

Key Features:

SQL-Centric: Define transformations as SQL models.
Testable: Validate data transformations with built-in testing capabilities.
Extensible: Use macros and hooks for advanced workflows.
Community-Driven: Backed by an active open-source community.

2. Benefits of dbt Core

Setting up dbt Core on GCP offers several advantages:

Cloud-Native Deployment: No need for local environments, reducing maintenance overhead.
Scalability: Take advantage of GCP’s compute and storage resources.
Integration: Seamlessly integrates with BigQuery, IAM, and other GCP services.
Collaboration: Centralized deployment allows multiple users to work on the same environment.

3. GCP Services Required to Set Up dbt Core

To deploy dbt Core on GCP, you will need the following services:

Google BigQuery: For executing and storing transformed data.
Cloud Storage: To store dbt project files and logs.
Cloud Compute (Compute Engine or Cloud Run): To host dbt Core.
IAM (Identity and Access Management): For secure access and role management.
Cloud SDK: For interacting with GCP services during setup.

4. Steps to Set Up dbt Core on GCP

Step 1: Prepare GCP Environment

Enable Required APIs: In the GCP Console, enable the following APIs:

1. BigQuery API

2. Compute Engine API

3. Cloud Storage API

Set Up IAM Roles: Create a service account with the following roles:

1. BigQuery Data Editor: To execute queries and transformations.

2. Storage Object Viewer: To read project files from Cloud Storage.

3. Storage Object Creator: To write logs or results.

Download Service Account Key: Save the JSON key file for the service account, as it will be used during authentication.

Step 2: Deploy dbt Core Using Compute Engine

Create a Virtual Machine (VM): In the GCP Console, create a VM instance:

领英推荐

AWS Simple Workflow vs AWS Step Functions vs Apache…

Neal K. Davis 2 年前

BigQuery FinOps optimization

Nicola Sfondrini 4 个月前

DATA Pill #068 - Amazon S3, Athena & AWS Glue…

Adam Kawa 1 年前

1. Machine Type: e2-medium or higher (depending on workload).

2. OS: Ubuntu 20.04 LTS.

Install Required Dependencies on the VM: SSH into the VM instance and execute the following commands:

# Update and install required packages
sudo apt update && sudo apt install -y python3-pip python3-venv git

# Create a virtual environment
python3 -m venv dbt-env
source dbt-env/bin/activate

# Install dbt with BigQuery adapter
pip install dbt-bigquery

Transfer Service Account Key: Upload the service account key to the VM securely using the following command:

scp path/to/service-account-key.json username@your-vm-ip:/path/to/destination/

Authenticate dbt with BigQuery: Set the environment variable for Google Application Credentials:

export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account-key.json

Step 3: Configure dbt Project

Initialize dbt Project: On the VM, run the following command to create a new dbt project:

dbt init my_dbt_project
cd my_dbt_project

Set Up `profiles.yml`: Create and configure the profiles.yml file:

bigquery_project:
  outputs:
    prod:
      type: bigquery
      method: service-account
      project: your-project-id
      dataset: your-dataset
      keyfile: /path/to/service-account-key.json
      threads: 4
  target: prod

Test Configuration: Run the following command to verify the connection:

dbt debug

Step 4: Automate dbt Core with Cloud Scheduler and Cloud Run

Containerize dbt Core: Create a Dockerfile to containerize dbt Core:

FROM python:3.9-slim
WORKDIR /app

# Install dbt and dependencies
RUN pip install dbt-bigquery

# Copy dbt project files
COPY . /app

CMD ["dbt", "run"]

Deploy to Cloud Run:

Build and push the Docker image to Container Registry:

gcloud builds submit --tag gcr.io/your-project-id/dbt-core

Deploy the image to Cloud Run:

gcloud run deploy dbt-core \
  --image gcr.io/your-project-id/dbt-core \
  --platform managed \
  --region us-central1 \
  --allow-unauthenticated

Schedule Runs with Cloud Scheduler: Use Cloud Scheduler to trigger the dbt Cloud Run deployment at specified intervals.

Conclusion

Setting up dbt Core on GCP allows you to leverage the power of cloud-native tools for data transformation. By integrating with BigQuery and automating workflows using Cloud Run and Cloud Scheduler, you can create a scalable and efficient data transformation pipeline.

要查看或添加评论，请登录

Sunil Rastogi的更多文章

Maximizing Your Lift-and-Shift Migration with GCP: Managed vs Unmanaged Instance Groups

2025年2月11日

Maximizing Your Lift-and-Shift Migration with GCP: Managed vs Unmanaged Instance Groups

Migrating workloads to the cloud can be a daunting task, especially when deciding how to organize your virtual machines…

1 条评论
Exploring DeepSeek: The Opensource AI Transforming LLMs

2025年1月27日

Exploring DeepSeek: The Opensource AI Transforming LLMs

Artificial intelligence is evolving rapidly, and we see its impact everywhere—from businesses integrating AI into their…
Streamlining Workloads: The Differences Between Cloud Run Jobs and Services

2024年9月19日

Streamlining Workloads: The Differences Between Cloud Run Jobs and Services

Google Cloud Platform (GCP) offers powerful serverless solutions to help developers deploy and manage applications…
Structured Process Language (SPL): Power and Precision for Data Transformation

2024年5月30日

Structured Process Language (SPL): Power and Precision for Data Transformation

Structured Process Language (SPL) is a powerful language designed specifically for data manipulation and processing…
Map vs. FlatMap in Apache Spark

2023年10月30日

Map vs. FlatMap in Apache Spark

In Apache Spark, map and flatMap are two fundamental transformations that are often used to manipulate and transform…

1 条评论
Sending Data to a Specific Partition in Kafka

2023年9月29日

Sending Data to a Specific Partition in Kafka

Explanation: We configure the Kafka producer with the necessary properties. We specify the topic to which we want to…

See all articles

Setting Up dbt Core on GCP: A Step-by-Step Guide

Sunil Rastogi

AWS/GCP Solutions Architect||Data Engineer||Python||Scala||Spark||Big Data||Snowflake||Freelancer

1. What is dbt Core?

2. Benefits of dbt Core

3. GCP Services Required to Set Up dbt Core

4. Steps to Set Up dbt Core on GCP

Step 1: Prepare GCP Environment

Step 2: Deploy dbt Core Using Compute Engine

领英推荐

Step 3: Configure dbt Project

Step 4: Automate dbt Core with Cloud Scheduler and Cloud Run

Conclusion

Sunil Rastogi的更多文章

社区洞察

其他会员也浏览了

Bigquery

Fast and Cost-Effective Querying with DuckDB on AWS Lambda (Docker Container): Scaling Queries on Parquet and Table Formats (Hudi | Iceberg | Delta) |

66% say AWS is the most required platform in job descriptions

AWS Glue vs. AWS DataSync: Choosing the Right Data Migration Tool

Architecting Serverless Data Processing Solutions with Azure Functions

Top Highlights from Microsoft Fabric Conference – Sweden 2024

Exploring the Power of Dataflow Flex Templates with Apache Beam

Deploy dbt Core Workloads on Azure Using Durable Functions

Demystifying AWS DataZone

An Introduction To Kubernetes

1. What is dbt Core?

2. Benefits of dbt Core

3. GCP Services Required to Set Up dbt Core

4. Steps to Set Up dbt Core on GCP

Step 1: Prepare GCP Environment

Step 2: Deploy dbt Core Using Compute Engine

领英推荐

Step 3: Configure dbt Project

Step 4: Automate dbt Core with Cloud Scheduler and Cloud Run

Conclusion

Sunil Rastogi的更多文章

Maximizing Your Lift-and-Shift Migration with GCP: Managed vs Unmanaged Instance Groups

Exploring DeepSeek: The Opensource AI Transforming LLMs

Streamlining Workloads: The Differences Between Cloud Run Jobs and Services

Structured Process Language (SPL): Power and Precision for Data Transformation

Map vs. FlatMap in Apache Spark

Sending Data to a Specific Partition in Kafka

社区洞察

其他会员也浏览了

Bigquery

Fast and Cost-Effective Querying with DuckDB on AWS Lambda (Docker Container): Scaling Queries on Parquet and Table Formats (Hudi | Iceberg | Delta) |

66% say AWS is the most required platform in job descriptions

AWS Glue vs. AWS DataSync: Choosing the Right Data Migration Tool

Architecting Serverless Data Processing Solutions with Azure Functions

Top Highlights from Microsoft Fabric Conference – Sweden 2024

Exploring the Power of Dataflow Flex Templates with Apache Beam

Deploy dbt Core Workloads on Azure Using Durable Functions

Demystifying AWS DataZone

An Introduction To Kubernetes