登录查看更多内容

Implementing Keyword Search in Hudi: Building Inverted Indexes with Record Level Index, Metadata Indexing and Point Lookups | Text search on data lake

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & AWS Glue| Data Lake(Hudi | Iceberg) Specialist | YouTuber

发布日期: 2024年7月14日

In the world of big data, efficient and scalable search functionality is crucial for deriving meaningful insights. Apache Hudi, with its advanced data management capabilities, offers a powerful solution for building keyword search functionalities on large-scale data lakes. This blog delves into the process of implementing keyword search using Apache Hudi, focusing on building inverted indexes, scaling for terabytes of text data, and leveraging record-level indexing, metadata indexing, and point lookups

Video Guides : <To be added soon>

Understanding Apache Hudi

Apache Hudi (Hadoop Upserts Deletes and Incrementals) is an open-source data management framework used to simplify incremental data processing and data pipeline development. Hudi provides support for managing large-scale data lakes on top of distributed storage systems like Amazon S3, HDFS, and Azure Blob Storage. Its primary features include support for upserts, incremental data processing, and a variety of indexing mechanisms to optimize data retrieval and queries.

Building the Foundation: The Bronze Documents Table

To implement keyword search, we first need a foundational table, bronze_documents, which will store the raw text data along with document IDs. This table will act as the base layer, holding the unprocessed textual data.

Building the Inverted Index

An inverted index is a data structure used to map content to its location within a dataset, making it a powerful tool for search functionalities. It allows for efficient keyword searches by maintaining a mapping between keywords and the documents they appear in.

To create an inverted index, we first tokenize the text, remove stop words, hash the keywords, and then build the index.

Define UDf in spark to create inverted index

Leveraging Record-Level Indexing and Metadata Indexing

Record-level indexing in Hudi allows for efficient point lookups and fast retrieval of specific records based on a unique identifier. This feature, combined with metadata indexing, enhances query performance and scalability, particularly when dealing with large datasets.

To enable these features, we configure Hudi options accordingly when writing data to the tables.

Performing Keyword Search

With the inverted index in place, performing keyword searches becomes straightforward. We hash the search keywords, look up the corresponding document IDs in the inverted index, and then fetch the relevant documents from the bronze_documents table.

Example: Searching with a Sentence

To search using a sentence, we split the sentence into individual words, remove stop words, hash the keywords, and perform a lookup in the inverted index to fetch the relevant documents.

Lookup are fast since we are using record level indexing.

User types

领英推荐

Databases Deconstructed: The Value of Data Lakehouses…

Alex Merced 8 个月前

Common HiveQL to BigQuery Migration Errors: A Detailed…

Aliz 1 年前

DATA Pill #014 - Future-Aware Data Engineering &…

Adam Kawa 2 年前

We remove stop words and grab the keyword and they respective hash values

Next Part we Query inverted index and grab appropriate Documents IDS the lookup is fast due to Record level index (Record Level Indexing in Apache Hudi Delivers 70% Faster Point Lookups) https://www.dhirubhai.net/pulse/record-level-indexing-apache-hudi-delivers-70-faster-point-shah-hlite/

From this I know documents 4 and 2 have this keywords. Now I can query main index with that primary key and fetch the documents

Output

Lookup are blazing fast since we have used record level index and hash to determine documents via inverted index and then directly querying documents store in hudi

Labs :

https://soumilshah1995.blogspot.com/2024/07/implementing-keyword-search-in-apache.html

This is a great starting point for anyone looking to perform text or keyword searches on data lakes. Feel free to improve and expand upon it, using this as a foundation. You can also incorporate ranking and filtering to return records based on relevance. I hope you find it helpful and enjoy working with it.

Conclusion

By leveraging Apache Hudi's advanced indexing capabilities, we can build efficient and scalable keyword search functionalities for large-scale data lakes. The combination of inverted indexing, record-level indexing, and metadata indexing ensures fast lookups and high query performance, making it a robust solution for handling terabytes of text data. With these techniques, you can enhance your data processing workflows and enable powerful search capabilities in your big data environment.

References

https://www.dhirubhai.net/pulse/record-level-indexing-apache-hudi-delivers-70-faster-point-shah-hlite/

https://soumilshah1995.blogspot.com/2024/01/learn-how-to-use-apache-hudi-data.html

要查看或添加评论，请登录

Soumil S.的更多文章

Building a High-Performance Data Analytics Service with Apache Arrow Flight and DuckDB and S3 Tables

2025年3月21日

Building a High-Performance Data Analytics Service with Apache Arrow Flight and DuckDB and S3 Tables

Introduction In today's data-driven world, organizations need efficient ways to access and analyze their data stored in…
Query S3 Tables from AWS Lambda Using DuckDB and Glue IRCC Endpoints

2025年3月16日

Query S3 Tables from AWS Lambda Using DuckDB and Glue IRCC Endpoints

Introduction Processing large-scale data stored in Amazon S3 quickly and efficiently has always been a challenge. With…

1 条评论
Query String Nested JSON Data in New S3 Table Buckets (Iceberg) with DuckDB via IRCC

2025年3月13日

Query String Nested JSON Data in New S3 Table Buckets (Iceberg) with DuckDB via IRCC

In the rapidly evolving data landscape, the ability to efficiently store and query complex JSON data has become…

1 条评论
DuckDB Now Supports Querying New S3 Table Buckets via Glue IRCC Endpoints

2025年3月13日

DuckDB Now Supports Querying New S3 Table Buckets via Glue IRCC Endpoints

DuckDB continues to push the boundaries of fast, in-memory analytics by now supporting querying of new S3 table buckets…

3 条评论
Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

2025年2月27日

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

This hands-on lab demonstrates how to query S3 Table Buckets (Managed Iceberg) using Trino. The tutorial covers…

4 条评论
Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

2025年2月25日

Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

Introduction Managing large-scale data lakes efficiently requires advanced techniques like dual write, where data is…

1 条评论
Enhancing Query Performance with Bloom Filters in Apache Iceberg

2025年2月23日

Enhancing Query Performance with Bloom Filters in Apache Iceberg

Introduction In large-scale data processing, optimizing query performance is crucial. Apache Iceberg, a powerful table…

2 条评论
S3 Incremental File Processing with Pessimistic Locking using S3 Lock

2025年2月17日

S3 Incremental File Processing with Pessimistic Locking using S3 Lock

What is Pessimistic Locking? Pessimistic locking is a concurrency control mechanism that prevents multiple processes…

2 条评论
Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

2025年2月16日

Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

Iceberg is a powerful table format designed for big data workloads, commonly used with Apache Spark. However, you can…

5 条评论
PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

2025年2月16日

PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

PyIceberg just got a whole lot more powerful! Version 0.9.

7 条评论

See all articles

Implementing Keyword Search in Hudi: Building Inverted Indexes with Record Level Index, Metadata Indexing and Point Lookups | Text search on data lake

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & AWS Glue| Data Lake(Hudi | Iceberg) Specialist | YouTuber

Understanding Apache Hudi

Building the Foundation: The Bronze Documents Table

Building the Inverted Index

Leveraging Record-Level Indexing and Metadata Indexing

Performing Keyword Search

Example: Searching with a Sentence

领英推荐

Labs :

Conclusion

References

Soumil S.的更多文章

社区洞察

其他会员也浏览了

Modern Data Stack - using Google AppSheet, Airflow, DBT, Google Big Query, and Looker Studio

Low-Latency Data Pipelines with Kafka and Apache Pinot

Predicting Property Market Trends: The Role of Data Science, Sentiment Analysis, and Time-Series Forecasting

Musings on Data, Part 1: lakes, houses, clouds, etc.

Efficient Data Modelling In DynamoDB

Optimizing Data-Intensive Node.js Applications with Columnar Databases

TT#11: "Tech Talk on Elasticsearch"

Elasticsearch: A Comprehensive Guide for Real-Time Data Analytics

SQL: The Data Superpower Driving Big Tech and AI Innovation

KPI Calculation: LeanXcale Online Aggregates

Understanding Apache Hudi

Building the Foundation: The Bronze Documents Table

Building the Inverted Index

Leveraging Record-Level Indexing and Metadata Indexing

Performing Keyword Search

Example: Searching with a Sentence

领英推荐

Labs :

Conclusion

References

Soumil S.的更多文章

Building a High-Performance Data Analytics Service with Apache Arrow Flight and DuckDB and S3 Tables

Query S3 Tables from AWS Lambda Using DuckDB and Glue IRCC Endpoints

Query String Nested JSON Data in New S3 Table Buckets (Iceberg) with DuckDB via IRCC

DuckDB Now Supports Querying New S3 Table Buckets via Glue IRCC Endpoints

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

Enhancing Query Performance with Bloom Filters in Apache Iceberg

S3 Incremental File Processing with Pessimistic Locking using S3 Lock

Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

社区洞察

其他会员也浏览了

Modern Data Stack - using Google AppSheet, Airflow, DBT, Google Big Query, and Looker Studio

Low-Latency Data Pipelines with Kafka and Apache Pinot

Predicting Property Market Trends: The Role of Data Science, Sentiment Analysis, and Time-Series Forecasting

Musings on Data, Part 1: lakes, houses, clouds, etc.

Efficient Data Modelling In DynamoDB

Optimizing Data-Intensive Node.js Applications with Columnar Databases

TT#11: "Tech Talk on Elasticsearch"

Elasticsearch: A Comprehensive Guide for Real-Time Data Analytics

SQL: The Data Superpower Driving Big Tech and AI Innovation

KPI Calculation: LeanXcale Online Aggregates