登录查看更多内容

Learn How to configure Trino with Hudi and Hive Metastore with MINIO Object Store Developer Guide

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & AWS Glue| Data Lake(Hudi | Iceberg) Specialist | YouTuber

发布日期: 2024年5月9日

In the realm of big data processing, efficient data storage and querying are paramount. Technologies like Trino , Apache Hudi, and Hive Metastore play pivotal roles in achieving seamless data handling at scale. In this guide, we'll walk through the process of configuring Trino with Hudi and Hive Metastore while leveraging MINIO Object Store for storage.

Video Guide

https://www.youtube.com/watch?v=gfU5_WEX1cM&feature=youtu.be

Step 1: Setting up the Environment

We'll begin by defining our environment using Docker Compose. Below is a sample docker-compose.yml file:

Step 2: Configuring Trino and Hudi

After setting up the environment, we need to configure Trino, Hudi, and Hive Metastore. Here are the configuration files and their explanations:

trino/etc/node.properties

This file specifies Trino node properties, including environment setup, data directory, and plugin directory.

trino/etc/jvm.config

This configures JVM options for Trino, optimizing memory usage and garbage collection.

trino/etc/config.properties

This sets up Trino as a coordinator node, specifies HTTP server port, and enables service discovery.

trino/etc/catalog/hudi.properties

This configures the Hudi connector with the Hive Metastore URI, MINIO Object Store credentials, and endpoint details.

Step 3: Sample Code Execution

create spark session

Use following Hudi property to do hive sync

Write data into Hudi

Query Via Trino

Output:

These queries demonstrate how to connect to Trino and execute SQL commands to interact with the data stored in Hudi.

With these steps, you've successfully configured Trino with Hudi and Hive Metastore using MINIO Object Store, enabling seamless big data processing and querying capabilities.

GH Link https://github.com/soumilshah1995?tab=repositories

Note: It's important to mention that the newer version comes with significant changes, including updates to the Java version. However, it's worth noting that this update has introduced a few bugs when querying data via Trino Hudi. Therefore, it's recommended to stick with the lower version until these issues are resolved.

Soumil S.

10 个月

GH REPO https://github.com/soumilshah1995?tab=repositories

要查看或添加评论，请登录

Soumil S.的更多文章

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

2025年2月27日

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

This hands-on lab demonstrates how to query S3 Table Buckets (Managed Iceberg) using Trino. The tutorial covers…

4 条评论
Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

2025年2月25日

Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

Introduction Managing large-scale data lakes efficiently requires advanced techniques like dual write, where data is…

1 条评论
Enhancing Query Performance with Bloom Filters in Apache Iceberg

2025年2月23日

Enhancing Query Performance with Bloom Filters in Apache Iceberg

Introduction In large-scale data processing, optimizing query performance is crucial. Apache Iceberg, a powerful table…

2 条评论
S3 Incremental File Processing with Pessimistic Locking using S3 Lock

2025年2月17日

S3 Incremental File Processing with Pessimistic Locking using S3 Lock

What is Pessimistic Locking? Pessimistic locking is a concurrency control mechanism that prevents multiple processes…

2 条评论
Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

2025年2月16日

Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

Iceberg is a powerful table format designed for big data workloads, commonly used with Apache Spark. However, you can…

5 条评论
PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

2025年2月16日

PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

PyIceberg just got a whole lot more powerful! Version 0.9.

7 条评论
Concurrent Writes Test for New S3 Table Buckets: Can 10 Spark Writers Performing MERGE INTO Different Partitions Handle It?

2025年2月14日

Concurrent Writes Test for New S3 Table Buckets: Can 10 Spark Writers Performing MERGE INTO Different Partitions Handle It?

Introduction In modern big data applications, managing concurrent writes to distributed storage systems like Amazon S3…

1 条评论
Leveraging S3 for Distributed Concurrency Control in Data Processing

2025年2月9日

Leveraging S3 for Distributed Concurrency Control in Data Processing

In distributed systems, managing concurrency—ensuring that only a set number of processes run in parallel—is crucial to…
Create EMR Transient Cluster, Submit PySpark Job with Async Callback, and Auto-Terminate the Cluster

2025年2月8日

Create EMR Transient Cluster, Submit PySpark Job with Async Callback, and Auto-Terminate the Cluster

n this blog, we'll walk through creating and managing an EMR (Elastic MapReduce) cluster on EC2 to run PySpark jobs…

2 条评论
Sync Existing Apache Iceberg Tables with AWS Glue Data Catalog: Run It Locally, on Airflow, or EMR with a Simple YAML-based Template

2025年1月25日

Sync Existing Apache Iceberg Tables with AWS Glue Data Catalog: Run It Locally, on Airflow, or EMR with a Simple YAML-based Template

If you have existing Iceberg tables and need to sync them with the AWS Glue Data Catalog, the iceberg-glue-syncPython…

1 条评论

See all articles

Learn How to configure Trino with Hudi and Hive Metastore with MINIO Object Store Developer Guide

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & AWS Glue| Data Lake(Hudi | Iceberg) Specialist | YouTuber

Step 1: Setting up the Environment

Step 2: Configuring Trino and Hudi

trino/etc/node.properties

trino/etc/jvm.config

trino/etc/config.properties

trino/etc/catalog/hudi.properties

Step 3: Sample Code Execution

Soumil S.的更多文章

社区洞察

其他会员也浏览了

A Guide to dbt Macros - Purpose, Benefits, and Usage

Using Tarql to Convert Excel Spreadsheets to RDF

Spring Data JPA Part 2: Mastering Entity Relationships and Repository Queries

Mastering Spring Data JPA – Pagination, Sorting & Custom Queries

Five VScode Extensions for Working with Data

Spring Boot Projections Uncovered: How to Fetch Just What You Need

Upcoming Data Talks from Alex Merced (And how to follow)

Storing Large Semi-Structured Data in Delta Tables Using Variant Type and Spark 4.0.0

When Data Engineers Trade: A Modern Stack for Real-time Market Sentiment Analysis and Automated Trading : Part 3 of 5

Building an ETL Pipeline to Process Web-Scraped Data to DB and Visualizing Data Using pgAdmin 4

Step 1: Setting up the Environment

Step 2: Configuring Trino and Hudi

trino/etc/node.properties

trino/etc/jvm.config

trino/etc/config.properties

trino/etc/catalog/hudi.properties

Step 3: Sample Code Execution

Soumil S.的更多文章

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

Enhancing Query Performance with Bloom Filters in Apache Iceberg

S3 Incremental File Processing with Pessimistic Locking using S3 Lock

Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

Concurrent Writes Test for New S3 Table Buckets: Can 10 Spark Writers Performing MERGE INTO Different Partitions Handle It?

Leveraging S3 for Distributed Concurrency Control in Data Processing

Create EMR Transient Cluster, Submit PySpark Job with Async Callback, and Auto-Terminate the Cluster

Sync Existing Apache Iceberg Tables with AWS Glue Data Catalog: Run It Locally, on Airflow, or EMR with a Simple YAML-based Template

社区洞察

其他会员也浏览了

A Guide to dbt Macros - Purpose, Benefits, and Usage

Using Tarql to Convert Excel Spreadsheets to RDF

Spring Data JPA Part 2: Mastering Entity Relationships and Repository Queries

Mastering Spring Data JPA – Pagination, Sorting & Custom Queries

Five VScode Extensions for Working with Data

Spring Boot Projections Uncovered: How to Fetch Just What You Need

Upcoming Data Talks from Alex Merced (And how to follow)

Storing Large Semi-Structured Data in Delta Tables Using Variant Type and Spark 4.0.0

When Data Engineers Trade: A Modern Stack for Real-time Market Sentiment Analysis and Automated Trading : Part 3 of 5

Building an ETL Pipeline to Process Web-Scraped Data to DB and Visualizing Data Using pgAdmin 4