Streamlining Data Transformation with dbt-core and PostgreSQL on Ubuntu

In the ever-evolving landscape of data analytics, the ability to efficiently transform and manage data is paramount. Enter dbt-core, an open-source command-line tool that empowers data analysts and engineers to transform data in their warehouses using SQL. It’s a game-changer for creating scalable, maintainable, and testable data pipelines, often referred to as dbt pipelines.

Setting Up a dbt Project with PostgreSQL on Ubuntu 22.04

Step 1: Install PostgreSQL

sudo apt-get update
sudo apt-get install postgresql postgresql-contrib
        

PostgreSQL, the robust and open-source relational database, is our starting point. Installation on Ubuntu is straightforward with the apt package manager.

Step 2: Create a PostgreSQL Database and User

CREATE DATABASE my_database;
CREATE USER my_user WITH ENCRYPTED PASSWORD 'my_password';
GRANT ALL PRIVILEGES ON DATABASE my_database TO my_user;
        

Next, we establish a new database and user, setting the stage for secure data management.

Step 3: Create Sample Data Create a CSV file named data.csv with the following content:

id,name,age
1,Alice,25
2,Bob,30
3,Charlie,35
        

We then create a simple CSV file to simulate data, which will later be imported into our PostgreSQL database.

Step 4: Create a Table in PostgreSQL and Import Data

psql -h localhost -U my_user -d my_database
        

Then, at the PostgreSQL prompt:

CREATE TABLE public.my_table (
    id INTEGER,
    name TEXT,
    age INTEGER
);
\copy public.my_table FROM 'data.csv' CSV HEADER
\q
        

With our data ready, we create a table in PostgreSQL and import our CSV data, bringing our sample data to life within the database.

Step 5: Install dbt-core and dbt-postgres connector

pip install dbt-core
pip install dbt-postgres        

Installing dbt-core is as simple as running a pip command, which sets us up for the next steps in our data transformation journey.

Step 6: Create a New dbt Project

dbt init mytest
        

Initializing a new dbt project lays the groundwork for our transformation tasks.

Step 7: Configure Your dbt Profile Edit the profiles.yml file located in the ~/.dbt/ directory:

mytest:
  target: dev
  outputs:
    dev:
      type: postgres
      host: localhost
      user: my_user
      pass: my_password
      port: 5432
      dbname: my_database
      schema: public
        

Configuring the dbt profile connects our dbt project with the PostgreSQL database we set up earlier.

Step 8: Create a dbt Model Create a new model in the models directory of your dbt project (models/my_model.sql):

SELECT
    id,
    name,
    age,
    CASE WHEN age < 30 THEN 'young' ELSE 'old' END AS age_group
FROM public.my_table
        

Creating a dbt model is where the magic happens. This SQL file defines the transformation logic that dbt will apply to our data.

Step 9: Run the dbt Pipeline

dbt run
        

Running the dbt pipeline executes the transformations we’ve defined, materializing the results back in our database.

Step 10: Query the Resulting Table

psql -h localhost -U my_user -d my_database
        

Then, at the PostgreSQL prompt:

SELECT * FROM public.my_first_dbt_model;
\q
        

After running our dbt pipeline, we can query the resulting table to see the transformed data.

Step 11: Create a dbt Test dbt tests are defined in .yml files in the tests directory of your dbt project. You can create a new test for your model (tests/my_test.yml):

version: 2

models:
  - name: my_first_dbt_model
    columns:
      - name: id
        tests:
          - unique
          - not_null
      - name: age
        tests:
          - not_null
        

This test checks that the id column is unique and not null, and that the age column is not null.

Step 12: Run the dbt Test

dbt test
        

You can run your dbt tests using the dbt test command. This will execute all your tests and validate your data.

Step 13: Document Your dbt Project

dbt allows you to add descriptions to your models and columns in the schema.yml files. These descriptions are then used to generate the documentation. Here’s an example of what this might look like:

version: 2

models:
  - name: my_first_dbt_model
    description: This is my first dbt model.
    columns:
      - name: id
        description: This is the unique identifier for each row.
        tests:
          - unique
          - not_null
      - name: name
        description: This is the name of the person.
      - name: age
        description: This is the age of the person.
        tests:
          - not_null
        

Step 14: Generate the dbt documentation

dbt docs generate
        

Generating documentation with dbt creates a static website that beautifully presents our project’s details.

Step 15: View the dbt documentation

dbt docs serve
        

Finally, we serve the generated documentation locally, allowing us to review our project’s comprehensive documentation.

Conclusion

The combination of dbt-core and PostgreSQL on Ubuntu provides a powerful and flexible environment for data transformation. By following the steps outlined above, you can set up a robust dbt project that streamlines your data workflows, ensuring that your data is not only reliable but also easily understandable and maintainable.

Tech Stack

Linux (ubuntu-22.04)

Python (Python 3.10.1)

Python venv or Pyenv( for managing python version and installing python packages)

dbt-core and dbt-postgres

postgress instance(local)

SQL

visual studio code / nvim ( code editor)

wsl (if you are on windows for installing linux)

要查看或添加评论,请登录

社区洞察

其他会员也浏览了