登录查看更多内容

Multi-database Joins with Trino and dbt

Sulfikkar Shylaja

Senior Data Engineer | Data Architect & Lead | Transforming Complex Data into Impactful Insights

发布日期: 2024年5月24日

This guide provides step-by-step instructions on how to use Trino and dbt to perform joins across multiple databases and create models in a third database.

Overview

Trino is a highly performant distributed SQL query engine, designed to query large data sets distributed over one or more heterogeneous data sources. With Trino, you can query data where it lives, including Hive, Cassandra, relational databases, and even proprietary data stores.

dbt (data build tool) is a command-line tool that enables data analysts and engineers to transform data in their warehouses more effectively. dbt does the T in ELT (Extract, Load, Transform) processes – it doesn’t extract or load data, but it’s extremely good at transforming data that’s already loaded into your warehouse.

In this guide, we will use Trino and dbt to join tables from two PostgreSQL databases (db1 and db2), and create a model in a third PostgreSQL database (my_database). The model will be a table that joins the orders table from db1 and the customers table from db2.

Prerequisites

Ensure that you have the following installed on your machine:

Python 3.10.12
PostgreSQL databases instance
A working linux environment

Here are the SQL scripts to create the databases, user, tables, and grant privileges. Please replace my_password with your actual password.

Create User:

CREATE USER my_user WITH PASSWORD 'my_password';

Create Databases:

CREATE DATABASE db1;
CREATE DATABASE db2;
CREATE DATABASE my_database;

Grant Privileges to User:

GRANT ALL PRIVILEGES ON DATABASE db1 TO my_user;
GRANT ALL PRIVILEGES ON DATABASE db2 TO my_user;
GRANT ALL PRIVILEGES ON DATABASE my_database TO my_user;

Create Tables:

Connect to db1 and create the orders table:

\c db1

CREATE TABLE public.orders (
    order_id INT PRIMARY KEY,
    customer_id INT,
    product TEXT,
    quantity INT
);

INSERT INTO public.orders (order_id, customer_id, product, quantity) VALUES
(1, 101, 'Apples', 5),
(2, 102, 'Oranges', 10),
(3, 103, 'Pears', 8),
(4, 104, 'Grapes', 15),
(5, 105, 'Bananas', 12);

Connect to db2 and create the customers table:

\c db2

CREATE TABLE public.customers (
    customer_id INT PRIMARY KEY,
    first_name TEXT,
    last_name TEXT,
    email TEXT
);

INSERT INTO public.customers (customer_id, first_name, last_name, email) VALUES
(101, 'John', 'Doe', '[email protected]'),
(102, 'Jane', 'Doe', '[email protected]'),
(103, 'Jim', 'Brown', '[email protected]'),
(104, 'Jake', 'Smith', '[email protected]'),
(105, 'Jill', 'Johnson', '[email protected]');

Please note that these scripts should be run by a PostgreSQL superuser or a user with the necessary privileges. Also, remember to replace 'my_password' with the actual password for my_user.

Install Java

Trino requires Java 21. If it’s not already installed, you can install it using the following commands:

sudo apt update
sudo apt install openjdk-21-jdk

Verify the installation by running java -version. The output should indicate that you’re running Java 21.

Install Trino

Download Trino 435 from the link https://repo1.maven.org/maven2/io/trino/trino-server/435/.
Extract the tarball using the command tar -xvf trino-435.tar.gz.
Navigate to the extracted directory cd trino-435.

Configure Trino

Open the etc/config.properties file and ensure it has the following content:

coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
query.max-memory=5GB
query.max-memory-per-node=1GB
discovery-server.enabled=true
discovery.uri=https://localhost:8080

Open the etc/jvm.config file and ensure it has the following content:

-server
-Xmx8G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError
-XX:OnOutOfMemoryError=kill -9 %p
-Djdk.attach.allowAttachSelf=true
-Djdk.nio.maxCachedBufferSize=2000000

Open the etc/node.properties file and ensure it has the following content:

node.environment=production
node.id=ffffffff-ffff-ffff-ffff-ffffffffffff
node.data-dir=/home/sush03e1/trino-server-435/var

Create a db1.properties, db2.properties, and my_database.properties file in the etc/catalog directory for each of your PostgreSQL databases (db1, db2, and my_database). Each file should have the following content (replace localhost, my_user, and my_password with your actual PostgreSQL host, user, and password):

领英推荐

Spark-SQL

Rohit Singh 2 个月前

Your Ultimate Guide to Best SQL Bootcamps of 2024

The Education Magazine 7 个月前

Testing DBtune, showing PostgreSQL double buffering…

Franck Pachot 9 个月前

connector.name=postgresql
connection-url=jdbc:postgresql://localhost/my_database
connection-user=my_user
connection-password=my_password

Create a db1.properties file: In the etc/catalog directory of your Trino installation, create a new file named db1.properties.

Configure the connector for db1: In the db1.properties file, add the necessary properties to configure the connector for your db1 database. For example:

connector.name=postgresql
connection-url=jdbc:postgresql://localhost/db1
connection-user=my_user
connection-password=my_password

Create a db2.properties file: In the etc/catalog directory of your Trino installation, create a new file named db2.properties.

Configure the connector for db2: In the db2.properties file, add the necessary properties to configure the connector for your db2 database. For example:

connector.name=postgresql
connection-url=jdbc:postgresql://localhost/db2
connection-user=my_user
connection-password=my_password

Create a my_database.properties file: In the etc/catalog directory of your Trino installation, create a new file named my_database.properties.

Configure the connector for my_database: In the my_database.properties file, add the necessary properties to configure the connector for your my_database. For example:

connector.name=postgresql
connection-url=jdbc:postgresql://localhost/my_database
connection-user=my_user
connection-password=my_password

Start Trino

Start Trino by running the launcher script in the bin directory:

./bin/launcher start

Check the status of the Trino server:

./bin/launcher status

Install dbt

Create a Python virtual environment and activate it:

python3 -m venv my_venv
source my_venv/bin/activate

Install dbt-core dbt-postgres and the dbt-trino adapter:

pip install dbt-core
pip install dbt-postgres
pip install dbt dbt-trino

Configure dbt

Create a profiles.yml file in the ~/.dbt directory with the following content (replace my_user and my_password with your actual Trino user and password):

my_project:
  target: dev
  outputs:
    dev:
      type: trino
      host: localhost
      port: 8080
      user: my_user
      password: my_password
      database: my_database
      schema: public

In your dbt project directory, create a sources.yml file in the models directory with the following content:

version: 2

sources:
  - name: db1
    database: db1
    schema: public
    tables:
      - name: orders

  - name: db2
    database: db2
    schema: public
    tables:
      - name: customers

Create a dbt model that joins the orders table from db1 and the customers table from db2. The model should be a .sql file in the models directory of your dbt project. Here’s an example:

{{ config(materialized='table') }}

WITH orders AS (
    SELECT * FROM {{ source('db1', 'orders') }}
),
customers AS (
    SELECT * FROM {{ source('db2', 'customers') }}
)
SELECT orders.order_id, orders.product, orders.quantity, customers.first_name, customers.last_name, customers.email
FROM orders
JOIN customers ON orders.customer_id = customers.customer_id

Run dbt

Finally, you can run your dbt project:

dbt run

This will create a new table in the my_database database that joins the orders table from db1 and the customers table from db2.

Verify the Result

You can verify the result by querying the my_project_model table in the my_database database using postgres.

Igor Martins

9 个月

Thanks for this post! helped me

1 次回应

Alfonso Paz Luna

Chief Product Owner @ SEB

9 个月

Thanks Sulfi Bashy for this post! It's a great guide! Extremely useful!!

2 次回应

查看更多评论

要查看或添加评论，请登录

Sulfikkar Shylaja的更多文章

What is dbt Semantic Layer

2025年3月17日

What is dbt Semantic Layer

Introduction As businesses scale, managing data consistency becomes a significant challenge. Different teams may define…
Implementing CI/CD with Bitbucket, TeamCity, and Ansible Automation Platform for Airflow & dbt Deployments

2025年3月8日

Implementing CI/CD with Bitbucket, TeamCity, and Ansible Automation Platform for Airflow & dbt Deployments

In modern data architectures, rapid and reliable deployments are critical. In our organization, different teams have…

1 条评论
Airflow Role-Based Access Control (RBAC): A Complete Guide

2025年2月12日

Airflow Role-Based Access Control (RBAC): A Complete Guide

Apache Airflow provides a powerful Role-Based Access Control (RBAC) system that enables fine-grained access control for…
Securing Your Apache Airflow Deployment: A Step-by-Step Guide to Role-Based Access Control

2025年2月5日

Securing Your Apache Airflow Deployment: A Step-by-Step Guide to Role-Based Access Control

Intro: Apache Airflow is a powerful tool for orchestrating complex workflows, but as teams scale, securing access to…
Building a Scalable Data Pipeline with dbt, Python, Podman, Airflow, and Ansible

2025年1月30日

Building a Scalable Data Pipeline with dbt, Python, Podman, Airflow, and Ansible

Introduction Managing data pipelines efficiently requires a scalable, automated, and containerized solution. In this…
A Complete Guide to Design Patterns in Python

2024年10月31日

A Complete Guide to Design Patterns in Python

Design Patterns are reusable solutions to common problems in software design. They’re like blueprints that guide…
Accessing Azure Data Lake through Databricks: Authentication Methods Explained

2024年10月16日

Accessing Azure Data Lake through Databricks: Authentication Methods Explained

Azure Data Lake (ADLS) is a powerful, scalable solution for handling vast amounts of data. When accessing ADLS through…

1 条评论
Understanding the Differences Between Data Warehouse, Data Lake, Data Lakehouse, and Delta Lake

2024年10月8日

Understanding the Differences Between Data Warehouse, Data Lake, Data Lakehouse, and Delta Lake

In today’s data-driven world, organizations are leveraging various data storage architectures to meet the demands of…
Essential dbt Concepts and Best Practices for Data Transformation Projects

2024年10月7日

Essential dbt Concepts and Best Practices for Data Transformation Projects

dbt (data build tool) has rapidly become one of the most powerful tools for transforming and modeling data in the data…
Unlocking the Power of the Lakehouse: A Layered Approach to Data Management

2024年10月7日

Unlocking the Power of the Lakehouse: A Layered Approach to Data Management

Building a Delta Lakehouse Architecture: A Step-by-Step Journey from Raw Data to Business Insights In today’s…

See all articles

Multi-database Joins with Trino and dbt

Sulfikkar Shylaja

Senior Data Engineer | Data Architect & Lead | Transforming Complex Data into Impactful Insights

Overview

Prerequisites

Install Java

Install Trino

Configure Trino

领英推荐

Start Trino

Install dbt

Configure dbt

Run dbt

Verify the Result

Sulfikkar Shylaja的更多文章

社区洞察

其他会员也浏览了

How We Improved Our SQL Parser Speed by 70x

AI2SQL: Bridging the Gap Between Non-Engineers and SQL Query Generation

Efficiently Managing Employee Records Using Azure SQL and Python

SQL

Top 8 Free, Open Source SQL Clients to Make Database Management Easier

Building a Data Pipeline with SQL, Python, and Azure Fabric

Revolutionizing SQLite Interactions: Why SqliteDict Is a Game-Changer for Developers

Journey To Database World: Part 7 (Document Database - MongoDB As Example)

Real-Time OLAP with Apache Pinot and Kafka: Practical Project

"Transform Your Career: Master SQL"

Overview

Prerequisites

Install Java

Install Trino

Configure Trino

领英推荐

Start Trino

Install dbt

Configure dbt

Run dbt

Verify the Result

Sulfikkar Shylaja的更多文章

What is dbt Semantic Layer

Implementing CI/CD with Bitbucket, TeamCity, and Ansible Automation Platform for Airflow & dbt Deployments

Airflow Role-Based Access Control (RBAC): A Complete Guide

Securing Your Apache Airflow Deployment: A Step-by-Step Guide to Role-Based Access Control

Building a Scalable Data Pipeline with dbt, Python, Podman, Airflow, and Ansible

A Complete Guide to Design Patterns in Python

Accessing Azure Data Lake through Databricks: Authentication Methods Explained

Understanding the Differences Between Data Warehouse, Data Lake, Data Lakehouse, and Delta Lake

Essential dbt Concepts and Best Practices for Data Transformation Projects

Unlocking the Power of the Lakehouse: A Layered Approach to Data Management

社区洞察

其他会员也浏览了

How We Improved Our SQL Parser Speed by 70x

AI2SQL: Bridging the Gap Between Non-Engineers and SQL Query Generation

Efficiently Managing Employee Records Using Azure SQL and Python

SQL

Top 8 Free, Open Source SQL Clients to Make Database Management Easier

Building a Data Pipeline with SQL, Python, and Azure Fabric

Revolutionizing SQLite Interactions: Why SqliteDict Is a Game-Changer for Developers

Journey To Database World: Part 7 (Document Database - MongoDB As Example)

Real-Time OLAP with Apache Pinot and Kafka: Practical Project

"Transform Your Career: Master SQL"