Hudi Best Practices: Handling Failed Inserts/Upserts with Error Tables

Hudi Best Practices: Handling Failed Inserts/Upserts with Error Tables


Introduction:

In today's data-driven world, ensuring data integrity is of utmost importance, especially in scenarios involving financial data, healthcare records, or any sensitive information. Apache Hudi, an open-source data storage and processing framework, offers robust features to handle failed inserts or upserts efficiently. In this article, we will delve into the best practices of using Hudi's pre-commit validator and error tables to handle failed operations and discuss the advantages of incorporating these practices into critical data workflows.

Understanding the Pre-Commit Validator in Hudi:

Apache Hudi's pre-commit validator is a powerful mechanism that allows users to define and execute validation rules on incoming data before committing it to the target table. By leveraging the pre-commit validator, data engineers can enforce business rules, data quality checks, and schema validation, ensuring that only valid and compliant records are processed further. This helps in preventing the insertion of erroneous data into the main dataset, thereby safeguarding data integrity.

Leveraging Error Tables for Failed Inserts/Upserts:

Handling failed inserts or upserts can be a challenging task, especially when dealing with large volumes of data. Error tables offer an elegant solution to this problem. When a validation rule fails during an insert or upsert operation, the erroneous records are automatically diverted to an error table instead of being added to the main dataset. This isolation of failed records simplifies the error resolution process and allows data analysts to investigate and correct the problematic data separately. For instance, in a financial application, if a transaction fails to meet certain compliance criteria, it can be directed to an error table for further analysis.


?

Benefits of Using Apache Hudi's Pre-Commit Validator and Error Tables:

Incorporating Apache Hudi's pre-commit validator and error tables brings several key advantages to data processing pipelines:

  1. Enhanced Data Integrity: By enforcing data validation rules with the pre-commit validator, organizations can ensure that only high-quality and reliable data is stored in the main table, reducing the risk of data corruption.
  2. Efficient Troubleshooting: Error tables offer a centralized location to capture failed records, making it easier to identify and rectify data anomalies. This streamlined troubleshooting process enhances data accuracy and accelerates issue resolution.
  3. Proactive Data Quality Management: The use of error tables enables data teams to monitor the incoming data for anomalies, empowering them to take proactive measures to improve overall data quality.
  4. Incremental Data Correction: Error tables facilitate incremental iterations over the problematic data, allowing for iterative data correction and backfilling, leading to continuous improvements in data quality over time.


Video based Guides


Lab Exercise: Step-by-Step Implementation:

No alt text provided for this image

Lets say we have a column called as message in data where I want to enforce rules any time message is null I don’t want to commit that data into Hudi tables and I would want to move the batch into error tables.

Download Jar

https://drive.google.com/file/d/1iSqNyj2k6WvNSwUGWblEnr_AQFlLJVzC/view

No alt text provided for this image

This Data will be UPSERTED Into Hudi tables

Lets try a data which has message as Null

No alt text provided for this image

Since I have validation rule which checks if message if NULL this items should be moved into error Hudi tables?


Results?

Error Tables

No alt text provided for this image

Events table

No alt text provided for this image

CODE

https://github.com/soumilshah1995/-Hudi-Best-Practices-Handling-Failed-Inserts-Upserts-with-Error-Tables/blob/main/demo.py


NOTE

Error Hudi tables provide you with enhanced visibility into which items have failed, when they failed, and aid in comprehending the reasons behind their failure. Moreover, you can subsequently iterate over the error tables incrementally, and if necessary, backfill your tables. This enables you to gain deeper insights into the errors, troubleshoot effectively, and make necessary improvements to your processes.


Conclusion

Incorporating Apache Hudi's pre-commit validator and error tables into your data processing pipelines is a strategic move towards ensuring data integrity and improving data quality. By handling failed inserts or upserts effectively, data teams can maintain a high standard of data accuracy and reliability, making Apache Hudi an invaluable asset in the modern data landscape. Empower your organization with these best practices to drive successful data-driven initiatives with confidence.



Udaykiran Noti

Software engineer

1 年

Cf

回复
Udaykiran Noti

Software engineer

1 年

Cf

回复
Victor M.

Senior Data Engineer Consultant at ThoughtWorks

1 年

要查看或添加评论,请登录

Soumil S.的更多文章

社区洞察

其他会员也浏览了