Multi-database Joins with Trino and dbt
Sulfikkar Shylaja
Senior Data Engineer | Data Architect & Lead | Transforming Complex Data into Impactful Insights
This guide provides step-by-step instructions on how to use Trino and dbt to perform joins across multiple databases and create models in a third database.
Overview
Trino is a highly performant distributed SQL query engine, designed to query large data sets distributed over one or more heterogeneous data sources. With Trino, you can query data where it lives, including Hive, Cassandra, relational databases, and even proprietary data stores.
dbt (data build tool) is a command-line tool that enables data analysts and engineers to transform data in their warehouses more effectively. dbt does the T in ELT (Extract, Load, Transform) processes – it doesn’t extract or load data, but it’s extremely good at transforming data that’s already loaded into your warehouse.
In this guide, we will use Trino and dbt to join tables from two PostgreSQL databases (db1 and db2), and create a model in a third PostgreSQL database (my_database). The model will be a table that joins the orders table from db1 and the customers table from db2.
Prerequisites
Ensure that you have the following installed on your machine:
Here are the SQL scripts to create the databases, user, tables, and grant privileges. Please replace my_password with your actual password.
Create User:
CREATE USER my_user WITH PASSWORD 'my_password';
Create Databases:
CREATE DATABASE db1;
CREATE DATABASE db2;
CREATE DATABASE my_database;
Grant Privileges to User:
GRANT ALL PRIVILEGES ON DATABASE db1 TO my_user;
GRANT ALL PRIVILEGES ON DATABASE db2 TO my_user;
GRANT ALL PRIVILEGES ON DATABASE my_database TO my_user;
Create Tables:
Connect to db1 and create the orders table:
\c db1
CREATE TABLE public.orders (
order_id INT PRIMARY KEY,
customer_id INT,
product TEXT,
quantity INT
);
INSERT INTO public.orders (order_id, customer_id, product, quantity) VALUES
(1, 101, 'Apples', 5),
(2, 102, 'Oranges', 10),
(3, 103, 'Pears', 8),
(4, 104, 'Grapes', 15),
(5, 105, 'Bananas', 12);
Connect to db2 and create the customers table:
\c db2
CREATE TABLE public.customers (
customer_id INT PRIMARY KEY,
first_name TEXT,
last_name TEXT,
email TEXT
);
INSERT INTO public.customers (customer_id, first_name, last_name, email) VALUES
(101, 'John', 'Doe', '[email protected]'),
(102, 'Jane', 'Doe', '[email protected]'),
(103, 'Jim', 'Brown', '[email protected]'),
(104, 'Jake', 'Smith', '[email protected]'),
(105, 'Jill', 'Johnson', '[email protected]');
Please note that these scripts should be run by a PostgreSQL superuser or a user with the necessary privileges. Also, remember to replace 'my_password' with the actual password for my_user.
Install Java
Trino requires Java 21. If it’s not already installed, you can install it using the following commands:
sudo apt update
sudo apt install openjdk-21-jdk
Verify the installation by running java -version. The output should indicate that you’re running Java 21.
Install Trino
Configure Trino
Open the etc/config.properties file and ensure it has the following content:
coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
query.max-memory=5GB
query.max-memory-per-node=1GB
discovery-server.enabled=true
discovery.uri=https://localhost:8080
Open the etc/jvm.config file and ensure it has the following content:
-server
-Xmx8G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError
-XX:OnOutOfMemoryError=kill -9 %p
-Djdk.attach.allowAttachSelf=true
-Djdk.nio.maxCachedBufferSize=2000000
Open the etc/node.properties file and ensure it has the following content:
node.environment=production
node.id=ffffffff-ffff-ffff-ffff-ffffffffffff
node.data-dir=/home/sush03e1/trino-server-435/var
Create a db1.properties, db2.properties, and my_database.properties file in the etc/catalog directory for each of your PostgreSQL databases (db1, db2, and my_database). Each file should have the following content (replace localhost, my_user, and my_password with your actual PostgreSQL host, user, and password):
领英推荐
connector.name=postgresql
connection-url=jdbc:postgresql://localhost/my_database
connection-user=my_user
connection-password=my_password
Create a db1.properties file: In the etc/catalog directory of your Trino installation, create a new file named db1.properties.
Configure the connector for db1: In the db1.properties file, add the necessary properties to configure the connector for your db1 database. For example:
connector.name=postgresql
connection-url=jdbc:postgresql://localhost/db1
connection-user=my_user
connection-password=my_password
Create a db2.properties file: In the etc/catalog directory of your Trino installation, create a new file named db2.properties.
Configure the connector for db2: In the db2.properties file, add the necessary properties to configure the connector for your db2 database. For example:
connector.name=postgresql
connection-url=jdbc:postgresql://localhost/db2
connection-user=my_user
connection-password=my_password
Create a my_database.properties file: In the etc/catalog directory of your Trino installation, create a new file named my_database.properties.
Configure the connector for my_database: In the my_database.properties file, add the necessary properties to configure the connector for your my_database. For example:
connector.name=postgresql
connection-url=jdbc:postgresql://localhost/my_database
connection-user=my_user
connection-password=my_password
Start Trino
Start Trino by running the launcher script in the bin directory:
./bin/launcher start
Check the status of the Trino server:
./bin/launcher status
Install dbt
Create a Python virtual environment and activate it:
python3 -m venv my_venv
source my_venv/bin/activate
Install dbt-core dbt-postgres and the dbt-trino adapter:
pip install dbt-core
pip install dbt-postgres
pip install dbt dbt-trino
Configure dbt
Create a profiles.yml file in the ~/.dbt directory with the following content (replace my_user and my_password with your actual Trino user and password):
my_project:
target: dev
outputs:
dev:
type: trino
host: localhost
port: 8080
user: my_user
password: my_password
database: my_database
schema: public
In your dbt project directory, create a sources.yml file in the models directory with the following content:
version: 2
sources:
- name: db1
database: db1
schema: public
tables:
- name: orders
- name: db2
database: db2
schema: public
tables:
- name: customers
Create a dbt model that joins the orders table from db1 and the customers table from db2. The model should be a .sql file in the models directory of your dbt project. Here’s an example:
{{ config(materialized='table') }}
WITH orders AS (
SELECT * FROM {{ source('db1', 'orders') }}
),
customers AS (
SELECT * FROM {{ source('db2', 'customers') }}
)
SELECT orders.order_id, orders.product, orders.quantity, customers.first_name, customers.last_name, customers.email
FROM orders
JOIN customers ON orders.customer_id = customers.customer_id
Run dbt
Finally, you can run your dbt project:
dbt run
This will create a new table in the my_database database that joins the orders table from db1 and the customers table from db2.
Verify the Result
You can verify the result by querying the my_project_model table in the my_database database using postgres.
Senior Data Engineer | Data Architect | Databricks | AWS | Airflow | Spark | PySpark | SQL | Python | Pipelines | CI/CD | Docker | Terraform
9 个月Thanks for this post! helped me
Chief Product Owner @ SEB
9 个月Thanks Sulfi Bashy for this post! It's a great guide! Extremely useful!!