Ingesting Data from Apache Pulsar Using Hudi Delta Streamer: A Step-by-Step Guide

Ingesting Data from Apache Pulsar Using Hudi Delta Streamer: A Step-by-Step Guide

Apache Pulsar and Apache Hudi are powerful tools for managing and processing streaming data. Pulsar is a cloud-native, distributed messaging and streaming platform, while Hudi is a data lake framework that brings stream processing to the world of big data. Using Hudi's Delta Streamer, we can seamlessly ingest data from Pulsar into Hudi, enabling real-time analytics and efficient data management. This step-by-step guide will walk you through the process of setting up Pulsar, MinIO, and Hudi Delta Streamer to achieve this integration.

Video Guide

https://www.youtube.com/watch?v=I_br7DTEDO0&t=138s



Step 1: Spin Up Apache Pulsar

First, we need to set up Apache Pulsar. We can do this using Docker, which simplifies the process of running Pulsar locally.

This command will start Pulsar in standalone mode, making it accessible at localhost:6650 for client connections and localhost:8080 for administrative tasks.

Step 2: Set Up MinIO

MinIO is an S3-compatible object storage server that we will use to store our data. Here's the Docker Compose file to set up MinIO:

Save the above content to a file named docker-compose.yml and run:

docker-compose up -d        

Step 3: Publish Sample Messages to Pulsar

Next, we will publish some sample messages to Pulsar using a Python script. First, install the necessary Python libraries:

pip install pulsar-client fastavro

Now, create the following Python script to send messages to a Pulsar topic:

Run this script to publish a message to the Pulsar topic.

Step 4: Set Up Hudi Delta Streamer

Hudi Delta Streamer allows you to ingest data into Hudi in a stream-oriented fashion, which makes it perfect for handling continuous data streams from Pulsar. Let's configure and run Delta Streamer.

Configure Spark

Create a spark-config.properties file with the following content:

Define Schema

Create a schema file gh_schema.avsc for the data:

Run Delta Streamer

Finally, run the Delta Streamer using the following command:

This command runs the Hudi Delta Streamer in continuous mode, ingesting data from the specified Pulsar topic and storing it in Hudi.

Conclusion

By following this guide, you've set up a pipeline to ingest data from Apache Pulsar into Hudi using Hudi Delta Streamer. This setup allows for efficient, real-time data ingestion and management, enabling you to handle large volumes of streaming data with ease. Hudi Delta Streamer simplifies the process, ensuring your data lake is always up-to-date and ready for analytics.


GH Repo

https://github.com/soumilshah1995/hudi-streamer-pulsar


If you have any question dont hesitate to post them I will try best to answer if I dont know ill probably learn and get back to you

要查看或添加评论,请登录

Soumil S.的更多文章

社区洞察

其他会员也浏览了