Multi-database Joins with Trino and dbt

This guide provides step-by-step instructions on how to use Trino and dbt to perform joins across multiple databases and create models in a third database.

Overview

Trino is a highly performant distributed SQL query engine, designed to query large data sets distributed over one or more heterogeneous data sources. With Trino, you can query data where it lives, including Hive, Cassandra, relational databases, and even proprietary data stores.

dbt (data build tool) is a command-line tool that enables data analysts and engineers to transform data in their warehouses more effectively. dbt does the T in ELT (Extract, Load, Transform) processes – it doesn’t extract or load data, but it’s extremely good at transforming data that’s already loaded into your warehouse.

In this guide, we will use Trino and dbt to join tables from two PostgreSQL databases (db1 and db2), and create a model in a third PostgreSQL database (my_database). The model will be a table that joins the orders table from db1 and the customers table from db2.

Prerequisites

Ensure that you have the following installed on your machine:

  • Python 3.10.12
  • PostgreSQL databases instance
  • A working linux environment

Here are the SQL scripts to create the databases, user, tables, and grant privileges. Please replace my_password with your actual password.

Create User:

CREATE USER my_user WITH PASSWORD 'my_password';
        

Create Databases:

CREATE DATABASE db1;
CREATE DATABASE db2;
CREATE DATABASE my_database;
        

Grant Privileges to User:

GRANT ALL PRIVILEGES ON DATABASE db1 TO my_user;
GRANT ALL PRIVILEGES ON DATABASE db2 TO my_user;
GRANT ALL PRIVILEGES ON DATABASE my_database TO my_user;
        

Create Tables:

Connect to db1 and create the orders table:

\c db1

CREATE TABLE public.orders (
    order_id INT PRIMARY KEY,
    customer_id INT,
    product TEXT,
    quantity INT
);

INSERT INTO public.orders (order_id, customer_id, product, quantity) VALUES
(1, 101, 'Apples', 5),
(2, 102, 'Oranges', 10),
(3, 103, 'Pears', 8),
(4, 104, 'Grapes', 15),
(5, 105, 'Bananas', 12);
        

Connect to db2 and create the customers table:

\c db2

CREATE TABLE public.customers (
    customer_id INT PRIMARY KEY,
    first_name TEXT,
    last_name TEXT,
    email TEXT
);

INSERT INTO public.customers (customer_id, first_name, last_name, email) VALUES
(101, 'John', 'Doe', '[email protected]'),
(102, 'Jane', 'Doe', '[email protected]'),
(103, 'Jim', 'Brown', '[email protected]'),
(104, 'Jake', 'Smith', '[email protected]'),
(105, 'Jill', 'Johnson', '[email protected]');
        

Please note that these scripts should be run by a PostgreSQL superuser or a user with the necessary privileges. Also, remember to replace 'my_password' with the actual password for my_user.

Install Java

Trino requires Java 21. If it’s not already installed, you can install it using the following commands:

sudo apt update
sudo apt install openjdk-21-jdk
        

Verify the installation by running java -version. The output should indicate that you’re running Java 21.

Install Trino

  1. Download Trino 435 from the link https://repo1.maven.org/maven2/io/trino/trino-server/435/.
  2. Extract the tarball using the command tar -xvf trino-435.tar.gz.
  3. Navigate to the extracted directory cd trino-435.

Configure Trino

Open the etc/config.properties file and ensure it has the following content:

coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
query.max-memory=5GB
query.max-memory-per-node=1GB
discovery-server.enabled=true
discovery.uri=https://localhost:8080
        

Open the etc/jvm.config file and ensure it has the following content:

-server
-Xmx8G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError
-XX:OnOutOfMemoryError=kill -9 %p
-Djdk.attach.allowAttachSelf=true
-Djdk.nio.maxCachedBufferSize=2000000
        

Open the etc/node.properties file and ensure it has the following content:

node.environment=production
node.id=ffffffff-ffff-ffff-ffff-ffffffffffff
node.data-dir=/home/sush03e1/trino-server-435/var
        

Create a db1.properties, db2.properties, and my_database.properties file in the etc/catalog directory for each of your PostgreSQL databases (db1, db2, and my_database). Each file should have the following content (replace localhost, my_user, and my_password with your actual PostgreSQL host, user, and password):

connector.name=postgresql
connection-url=jdbc:postgresql://localhost/my_database
connection-user=my_user
connection-password=my_password
        

Create a db1.properties file: In the etc/catalog directory of your Trino installation, create a new file named db1.properties.

Configure the connector for db1: In the db1.properties file, add the necessary properties to configure the connector for your db1 database. For example:

connector.name=postgresql
connection-url=jdbc:postgresql://localhost/db1
connection-user=my_user
connection-password=my_password
        

Create a db2.properties file: In the etc/catalog directory of your Trino installation, create a new file named db2.properties.

Configure the connector for db2: In the db2.properties file, add the necessary properties to configure the connector for your db2 database. For example:

connector.name=postgresql
connection-url=jdbc:postgresql://localhost/db2
connection-user=my_user
connection-password=my_password
        

Create a my_database.properties file: In the etc/catalog directory of your Trino installation, create a new file named my_database.properties.

Configure the connector for my_database: In the my_database.properties file, add the necessary properties to configure the connector for your my_database. For example:

connector.name=postgresql
connection-url=jdbc:postgresql://localhost/my_database
connection-user=my_user
connection-password=my_password
        

Start Trino

Start Trino by running the launcher script in the bin directory:

./bin/launcher start
        

Check the status of the Trino server:

./bin/launcher status
        

Install dbt

Create a Python virtual environment and activate it:

python3 -m venv my_venv
source my_venv/bin/activate
        

Install dbt-core dbt-postgres and the dbt-trino adapter:

pip install dbt-core
pip install dbt-postgres
pip install dbt dbt-trino
        

Configure dbt

Create a profiles.yml file in the ~/.dbt directory with the following content (replace my_user and my_password with your actual Trino user and password):

my_project:
  target: dev
  outputs:
    dev:
      type: trino
      host: localhost
      port: 8080
      user: my_user
      password: my_password
      database: my_database
      schema: public
        

In your dbt project directory, create a sources.yml file in the models directory with the following content:

version: 2

sources:
  - name: db1
    database: db1
    schema: public
    tables:
      - name: orders

  - name: db2
    database: db2
    schema: public
    tables:
      - name: customers
        

Create a dbt model that joins the orders table from db1 and the customers table from db2. The model should be a .sql file in the models directory of your dbt project. Here’s an example:

{{ config(materialized='table') }}

WITH orders AS (
    SELECT * FROM {{ source('db1', 'orders') }}
),
customers AS (
    SELECT * FROM {{ source('db2', 'customers') }}
)
SELECT orders.order_id, orders.product, orders.quantity, customers.first_name, customers.last_name, customers.email
FROM orders
JOIN customers ON orders.customer_id = customers.customer_id
        

Run dbt

Finally, you can run your dbt project:

dbt run
        

This will create a new table in the my_database database that joins the orders table from db1 and the customers table from db2.

Verify the Result

You can verify the result by querying the my_project_model table in the my_database database using postgres.


Igor Martins

Senior Data Engineer | Data Architect | Databricks | AWS | Airflow | Spark | PySpark | SQL | Python | Pipelines | CI/CD | Docker | Terraform

9 个月

Thanks for this post! helped me

Alfonso Paz Luna

Chief Product Owner @ SEB

9 个月

Thanks Sulfi Bashy for this post! It's a great guide! Extremely useful!!

要查看或添加评论,请登录

Sulfikkar Shylaja的更多文章

社区洞察

其他会员也浏览了