登录查看更多内容

Streamlining Data Transformation with dbt-core and PostgreSQL on Ubuntu

Sulfi Bashy

Azure Data Engineer & Architect | Agile Data Leader with Proven Stakeholder Engagement | Passionate Problem Solver & Continuous Learner | Sharing Knowledge through Technical Writing

发布日期: 2024年5月17日

In the ever-evolving landscape of data analytics, the ability to efficiently transform and manage data is paramount. Enter dbt-core, an open-source command-line tool that empowers data analysts and engineers to transform data in their warehouses using SQL. It’s a game-changer for creating scalable, maintainable, and testable data pipelines, often referred to as dbt pipelines.

Setting Up a dbt Project with PostgreSQL on Ubuntu 22.04

Step 1: Install PostgreSQL

sudo apt-get update
sudo apt-get install postgresql postgresql-contrib

PostgreSQL, the robust and open-source relational database, is our starting point. Installation on Ubuntu is straightforward with the apt package manager.

Step 2: Create a PostgreSQL Database and User

CREATE DATABASE my_database;
CREATE USER my_user WITH ENCRYPTED PASSWORD 'my_password';
GRANT ALL PRIVILEGES ON DATABASE my_database TO my_user;

Next, we establish a new database and user, setting the stage for secure data management.

Step 3: Create Sample Data Create a CSV file named data.csv with the following content:

id,name,age
1,Alice,25
2,Bob,30
3,Charlie,35

We then create a simple CSV file to simulate data, which will later be imported into our PostgreSQL database.

Step 4: Create a Table in PostgreSQL and Import Data

psql -h localhost -U my_user -d my_database

Then, at the PostgreSQL prompt:

CREATE TABLE public.my_table (
    id INTEGER,
    name TEXT,
    age INTEGER
);
\copy public.my_table FROM 'data.csv' CSV HEADER
\q

With our data ready, we create a table in PostgreSQL and import our CSV data, bringing our sample data to life within the database.

Step 5: Install dbt-core and dbt-postgres connector

pip install dbt-core
pip install dbt-postgres

Installing dbt-core is as simple as running a pip command, which sets us up for the next steps in our data transformation journey.

Step 6: Create a New dbt Project

dbt init mytest

Initializing a new dbt project lays the groundwork for our transformation tasks.

Step 7: Configure Your dbt Profile Edit the profiles.yml file located in the ~/.dbt/ directory:

mytest:
  target: dev
  outputs:
    dev:
      type: postgres
      host: localhost
      user: my_user
      pass: my_password
      port: 5432
      dbname: my_database
      schema: public

Configuring the dbt profile connects our dbt project with the PostgreSQL database we set up earlier.

Step 8: Create a dbt Model Create a new model in the models directory of your dbt project (models/my_model.sql):

SELECT
    id,
    name,
    age,
    CASE WHEN age < 30 THEN 'young' ELSE 'old' END AS age_group
FROM public.my_table

Creating a dbt model is where the magic happens. This SQL file defines the transformation logic that dbt will apply to our data.

Step 9: Run the dbt Pipeline

dbt run

Running the dbt pipeline executes the transformations we’ve defined, materializing the results back in our database.

领英推荐

SELECT News From Yugabyte - Jan 23

Yugabyte 1 年前

Why Postgres Stands Out Among Relational Databases

Adam Brown Sr. 1 个月前

Migrating an On-Premises SQL Server Database to…

John Barry O'Connell 1 个月前

Step 10: Query the Resulting Table

psql -h localhost -U my_user -d my_database

Then, at the PostgreSQL prompt:

SELECT * FROM public.my_first_dbt_model;
\q

After running our dbt pipeline, we can query the resulting table to see the transformed data.

Step 11: Create a dbt Test dbt tests are defined in .yml files in the tests directory of your dbt project. You can create a new test for your model (tests/my_test.yml):

version: 2

models:
  - name: my_first_dbt_model
    columns:
      - name: id
        tests:
          - unique
          - not_null
      - name: age
        tests:
          - not_null

This test checks that the id column is unique and not null, and that the age column is not null.

Step 12: Run the dbt Test

dbt test

You can run your dbt tests using the dbt test command. This will execute all your tests and validate your data.

Step 13: Document Your dbt Project

dbt allows you to add descriptions to your models and columns in the schema.yml files. These descriptions are then used to generate the documentation. Here’s an example of what this might look like:

version: 2

models:
  - name: my_first_dbt_model
    description: This is my first dbt model.
    columns:
      - name: id
        description: This is the unique identifier for each row.
        tests:
          - unique
          - not_null
      - name: name
        description: This is the name of the person.
      - name: age
        description: This is the age of the person.
        tests:
          - not_null

Step 14: Generate the dbt documentation

dbt docs generate

Generating documentation with dbt creates a static website that beautifully presents our project’s details.

Step 15: View the dbt documentation

dbt docs serve

Finally, we serve the generated documentation locally, allowing us to review our project’s comprehensive documentation.

Conclusion

The combination of dbt-core and PostgreSQL on Ubuntu provides a powerful and flexible environment for data transformation. By following the steps outlined above, you can set up a robust dbt project that streamlines your data workflows, ensuring that your data is not only reliable but also easily understandable and maintainable.

Tech Stack

Linux (ubuntu-22.04)

Python (Python 3.10.1)

Python venv or Pyenv( for managing python version and installing python packages)

dbt-core and dbt-postgres

postgress instance(local)

SQL

visual studio code / nvim ( code editor)

wsl (if you are on windows for installing linux)

Streamlining Data Transformation with dbt-core and PostgreSQL on Ubuntu

Sulfi Bashy

Azure Data Engineer & Architect | Agile Data Leader with Proven Stakeholder Engagement | Passionate Problem Solver & Continuous Learner | Sharing Knowledge through Technical Writing

领英推荐

更多精彩文章

社区洞察

其他会员也浏览了

High-Performance PostgreSQL: A Dive Into the Internals

Green plum architecture

A Step-by-Step Guide to Installing Trino for Data Migration

What is going on during optimization in PostgreSQL?

Efficient Data Loading and Management in PostgreSQL 15 Using Composable JSON Tags

Postgres for Everything

In-Depth Exploration of PostgreSQL's Process Architecture

60+ Top PostgreSQL Tools

Understanding the Difference Between SQL and PostgreSQL: Which One Should You Choose for Your Project?

Working with JSON data in PostgreSQL

领英推荐

A Complete Guide to Design Patterns in Python

2024年10月31日

Accessing Azure Data Lake through Databricks: Authentication Methods Explained

2024年10月16日

Understanding the Differences Between Data Warehouse, Data Lake, Data Lakehouse, and Delta Lake

2024年10月8日

Essential dbt Concepts and Best Practices for Data Transformation Projects

2024年10月7日

Unlocking the Power of the Lakehouse: A Layered Approach to Data Management

2024年10月7日

Trunk-Based Development vs. Main and Release Branches: Which One Is Better?

2024年10月4日

Mastering Data Transformation: The Power of Staging, Intermediate, and Mart Layers

2024年8月14日

Unlocking the Power of Data with the Data Fabric Approach

2024年8月14日

Multi-database Joins with Trino and dbt

2024年5月24日

Unveiling the Layers of Analytics: From Descriptive to Prescriptive

2024年2月6日

社区洞察

其他会员也浏览了

High-Performance PostgreSQL: A Dive Into the Internals

Green plum architecture

A Step-by-Step Guide to Installing Trino for Data Migration

What is going on during optimization in PostgreSQL?

Efficient Data Loading and Management in PostgreSQL 15 Using Composable JSON Tags

Postgres for Everything

In-Depth Exploration of PostgreSQL's Process Architecture

60+ Top PostgreSQL Tools

Understanding the Difference Between SQL and PostgreSQL: Which One Should You Choose for Your Project?

Working with JSON data in PostgreSQL