Hudi Best Practices: Handling Failed Inserts/Upserts with Error Tables
Introduction:
In today's data-driven world, ensuring data integrity is of utmost importance, especially in scenarios involving financial data, healthcare records, or any sensitive information. Apache Hudi, an open-source data storage and processing framework, offers robust features to handle failed inserts or upserts efficiently. In this article, we will delve into the best practices of using Hudi's pre-commit validator and error tables to handle failed operations and discuss the advantages of incorporating these practices into critical data workflows.
Understanding the Pre-Commit Validator in Hudi:
Apache Hudi's pre-commit validator is a powerful mechanism that allows users to define and execute validation rules on incoming data before committing it to the target table. By leveraging the pre-commit validator, data engineers can enforce business rules, data quality checks, and schema validation, ensuring that only valid and compliant records are processed further. This helps in preventing the insertion of erroneous data into the main dataset, thereby safeguarding data integrity.
Leveraging Error Tables for Failed Inserts/Upserts:
Handling failed inserts or upserts can be a challenging task, especially when dealing with large volumes of data. Error tables offer an elegant solution to this problem. When a validation rule fails during an insert or upsert operation, the erroneous records are automatically diverted to an error table instead of being added to the main dataset. This isolation of failed records simplifies the error resolution process and allows data analysts to investigate and correct the problematic data separately. For instance, in a financial application, if a transaction fails to meet certain compliance criteria, it can be directed to an error table for further analysis.
?
Benefits of Using Apache Hudi's Pre-Commit Validator and Error Tables:
Incorporating Apache Hudi's pre-commit validator and error tables brings several key advantages to data processing pipelines:
Video based Guides
Lab Exercise: Step-by-Step Implementation:
Lets say we have a column called as message in data where I want to enforce rules any time message is null I don’t want to commit that data into Hudi tables and I would want to move the batch into error tables.
Download Jar
领英推荐
This Data will be UPSERTED Into Hudi tables
Lets try a data which has message as Null
Since I have validation rule which checks if message if NULL this items should be moved into error Hudi tables?
Results?
Error Tables
Events table
CODE
NOTE
Error Hudi tables provide you with enhanced visibility into which items have failed, when they failed, and aid in comprehending the reasons behind their failure. Moreover, you can subsequently iterate over the error tables incrementally, and if necessary, backfill your tables. This enables you to gain deeper insights into the errors, troubleshoot effectively, and make necessary improvements to your processes.
Conclusion
Incorporating Apache Hudi's pre-commit validator and error tables into your data processing pipelines is a strategic move towards ensuring data integrity and improving data quality. By handling failed inserts or upserts effectively, data teams can maintain a high standard of data accuracy and reliability, making Apache Hudi an invaluable asset in the modern data landscape. Empower your organization with these best practices to drive successful data-driven initiatives with confidence.
Software engineer
1 年Cf
Software engineer
1 年Cf
Senior Data Engineer Consultant at ThoughtWorks
1 年Diego Lima