Streamlining Data Transformation with dbt-core and PostgreSQL on Ubuntu
Sulfi Bashy
Azure Data Engineer & Architect | Agile Data Leader with Proven Stakeholder Engagement | Passionate Problem Solver & Continuous Learner | Sharing Knowledge through Technical Writing
In the ever-evolving landscape of data analytics, the ability to efficiently transform and manage data is paramount. Enter dbt-core, an open-source command-line tool that empowers data analysts and engineers to transform data in their warehouses using SQL. It’s a game-changer for creating scalable, maintainable, and testable data pipelines, often referred to as dbt pipelines.
Setting Up a dbt Project with PostgreSQL on Ubuntu 22.04
Step 1: Install PostgreSQL
sudo apt-get update
sudo apt-get install postgresql postgresql-contrib
PostgreSQL, the robust and open-source relational database, is our starting point. Installation on Ubuntu is straightforward with the apt package manager.
Step 2: Create a PostgreSQL Database and User
CREATE DATABASE my_database;
CREATE USER my_user WITH ENCRYPTED PASSWORD 'my_password';
GRANT ALL PRIVILEGES ON DATABASE my_database TO my_user;
Next, we establish a new database and user, setting the stage for secure data management.
Step 3: Create Sample Data Create a CSV file named data.csv with the following content:
id,name,age
1,Alice,25
2,Bob,30
3,Charlie,35
We then create a simple CSV file to simulate data, which will later be imported into our PostgreSQL database.
Step 4: Create a Table in PostgreSQL and Import Data
psql -h localhost -U my_user -d my_database
Then, at the PostgreSQL prompt:
CREATE TABLE public.my_table (
id INTEGER,
name TEXT,
age INTEGER
);
\copy public.my_table FROM 'data.csv' CSV HEADER
\q
With our data ready, we create a table in PostgreSQL and import our CSV data, bringing our sample data to life within the database.
Step 5: Install dbt-core and dbt-postgres connector
pip install dbt-core
pip install dbt-postgres
Installing dbt-core is as simple as running a pip command, which sets us up for the next steps in our data transformation journey.
Step 6: Create a New dbt Project
dbt init mytest
Initializing a new dbt project lays the groundwork for our transformation tasks.
Step 7: Configure Your dbt Profile Edit the profiles.yml file located in the ~/.dbt/ directory:
mytest:
target: dev
outputs:
dev:
type: postgres
host: localhost
user: my_user
pass: my_password
port: 5432
dbname: my_database
schema: public
Configuring the dbt profile connects our dbt project with the PostgreSQL database we set up earlier.
Step 8: Create a dbt Model Create a new model in the models directory of your dbt project (models/my_model.sql):
SELECT
id,
name,
age,
CASE WHEN age < 30 THEN 'young' ELSE 'old' END AS age_group
FROM public.my_table
Creating a dbt model is where the magic happens. This SQL file defines the transformation logic that dbt will apply to our data.
Step 9: Run the dbt Pipeline
dbt run
Running the dbt pipeline executes the transformations we’ve defined, materializing the results back in our database.
领英推荐
Step 10: Query the Resulting Table
psql -h localhost -U my_user -d my_database
Then, at the PostgreSQL prompt:
SELECT * FROM public.my_first_dbt_model;
\q
After running our dbt pipeline, we can query the resulting table to see the transformed data.
Step 11: Create a dbt Test dbt tests are defined in .yml files in the tests directory of your dbt project. You can create a new test for your model (tests/my_test.yml):
version: 2
models:
- name: my_first_dbt_model
columns:
- name: id
tests:
- unique
- not_null
- name: age
tests:
- not_null
This test checks that the id column is unique and not null, and that the age column is not null.
Step 12: Run the dbt Test
dbt test
You can run your dbt tests using the dbt test command. This will execute all your tests and validate your data.
Step 13: Document Your dbt Project
dbt allows you to add descriptions to your models and columns in the schema.yml files. These descriptions are then used to generate the documentation. Here’s an example of what this might look like:
version: 2
models:
- name: my_first_dbt_model
description: This is my first dbt model.
columns:
- name: id
description: This is the unique identifier for each row.
tests:
- unique
- not_null
- name: name
description: This is the name of the person.
- name: age
description: This is the age of the person.
tests:
- not_null
Step 14: Generate the dbt documentation
dbt docs generate
Generating documentation with dbt creates a static website that beautifully presents our project’s details.
Step 15: View the dbt documentation
dbt docs serve
Finally, we serve the generated documentation locally, allowing us to review our project’s comprehensive documentation.
Conclusion
The combination of dbt-core and PostgreSQL on Ubuntu provides a powerful and flexible environment for data transformation. By following the steps outlined above, you can set up a robust dbt project that streamlines your data workflows, ensuring that your data is not only reliable but also easily understandable and maintainable.
Tech Stack
Linux (ubuntu-22.04)
Python (Python 3.10.1)
Python venv or Pyenv( for managing python version and installing python packages)
dbt-core and dbt-postgres
postgress instance(local)
SQL
visual studio code / nvim ( code editor)
wsl (if you are on windows for installing linux)