Apache Hudi

数据基础架构与分析

San Francisco，CA 9,968 位关注者

Open source pioneer of the lakehouse reimagining batch processing with incremental framework for low latency analytics

关注

查看全部 1 位员工

关于我们

Open source pioneer of the lakehouse reimagining old-school batch processing with a powerful new incremental framework for low latency analytics. Hudi brings database and data warehouse capabilities to the data lake making it possible to create a unified data lakehouse for ETL, analytics, AI/ML, and more. Apache Hudi is battle-tested at scale powering some of the largest data lakes on the planet. Apache Hudi provides an open foundation that seamlessly connects to all other popular open source tools such as Spark, Presto, Trino, Flink, Hive, and so much more. Being an open source table format is not enough, Apache Hudi is also a comprehensive platform of open services and tools that are necessary to operate your data lakehouse in production at scale. Most importantly, Apache Hudi is a community built by a diverse group of engineers from all around the globe! Hudi is a friendly and inviting open source community that is growing every day. Join the community in Github: https://github.com/apache/hudi or find links to email lists and slack channels on the Hudi website: https://hudi.apache.org/

网站: https://hudi.apache.org/
Apache Hudi的外部链接
所属行业: 数据基础架构与分析
规模: 201-500 人
总部: San Francisco，CA
类型: 非营利机构
创立: 2016
领域: ApacheHudi、DataEngineering、ApacheSpark、ApacheFlink、TrinoDB、Presto、DataAnalytics、DataLakehouse、AWS、GCP、Azure、ChangeDataCapture和StreamProcessing

地点

主要

US，CA，San Francisco

获取路线

Apache Hudi员工

Dragan V.

Search Engine Optimization Expert

查看全部员工

动态

Apache Hudi转发了

Shashank Mishra ????

Data Engineer @ Prophecy???♂? Building GrowDataSkills ?? YouTuber (176k+ Subs)??Teaching Data Engineering ?? Public Speaker ???? Ex-Expedia, Amazon, McKinsey, PayTm
19 小时前
举报此动态
Did you know Apache Hudi started at Uber to handle their large-scale data freshness needs, processing over 500M events per day? Apache Hudi Streamer, a game-changer for building near-real-time pipelines with minimal latency and transactional consistency ???? 1?? What is Hudi Streamer? Hudi Streamer is a utility designed to ingest streaming data into Hudi tables seamlessly, enabling upserts, incremental pulls, and time-travel queries. It's the backbone for real-time use cases, empowering data engineers to transform batch ETL workflows into blazing-fast streaming pipelines. ?? 2?? Key Features That Set Hudi Streamer Apart ???? ?? Streaming Data Ingestion - Supports ingestion from sources like Kafka, Kinesis, or Event Hubs, with built-in capabilities for managing CDC (Change Data Capture) data. ?? Upserts at Scale - Unlike traditional streaming frameworks, Hudi Streamer ensures that only changed data is updated, reducing I/O overhead and improving performance. ?? Schema Evolution - Enables seamless changes in schema, keeping your pipelines robust as your data evolves. No more breaking pipelines! ? ?? Incremental Data Processing - Pull only the data that changed—real-time analytics at its best. Say goodbye to full table scans! ?? ?? Time Travel Queries - Debugging or compliance? Access historical versions of data with ease, making audits and rollback operations a breeze. ?? Seamless Integration - Works out of the box with Spark Structured Streaming, making it an excellent fit for modern Lakehouse architectures like Iceberg or Delta Lake. 3?? How Does It Work? - Hudi Streamer leverages Hudi DeltaStreamer, a powerful tool to ???? ?? Read from streams (Kafka, DFS, etc.). ?? Apply transformations using Spark SQL or custom logic. ?? Write data into Hudi tables, supporting both COPY_ON_WRITE (optimized for reads) and MERGE_ON_READ (optimized for writes). Companies embracing real-time analytics—like Uber, LinkedIn, and Netflix—are leveraging tools like Hudi Streamer to power fraud detection, personalized recommendations, and supply chain optimization ??? ?? I have just started the new batch of my "Data Engineering With AWS" BootCAMP which is high quality, affordable, practical & industry grade project oriented???I have included Apache Flink, Hudi & Iceberg too?? ?? Enroll Here - https://bit.ly/3Y5gCJE ?? Dedicated placement assistance & doubt support ?? Call/WhatsApp for any query (+91) 9893181542 Cheers - Grow Data Skills ?? #dataengineering
2 条评论

赞评论分享
Apache Hudi转发了

Dipankar Mazumdar, M.Sc ??

Staff Data Engineer Advocate @Onehouse.ai | Apache Hudi, Iceberg Contributor | Author of "Engineering Lakehouses"
22 小时前
举报此动态
Open Lakehouse Architecture - But what's really open? There’s a noticeable shift in how users/customers are now consistently thinking about data architectures. You hear about terms like 'Open lakehouse'. But what does “open” really mean? This is not something easy to define. Still, it looks like there is a general agreement on one core idea. Data should reside as an 'open and independent' tier, allowing all compatible compute engines to operate on that “single copy” based on workloads. The most important thing to understand here is that just replacing proprietary storage formats with 'open table formats' doesn’t automatically make everything open and interoperable. In reality, customers end up choosing a particular open table format (based on vendor support), while being tied to proprietary services & tools for things like optimization, maintenance, among others. This confusion is created by the growing use of jargons like "open data lakehouse" and "open table formats”. And no, this is not about build vs buy! You can still buy vendor solutions while maintaining an open & interoperable platform. The 'key' is that when new workloads arise, you should be able to integrate other tools or seamlessly switch between compute platforms. I sought out to answer some of the questions that have been in my mind in this blog (in comments). This is from the perspective of having worked with the 3 table formats (Apache Hudi, Apache Iceberg & Delta Lake) for the past couple years of my career. Some questions that I ask: ? What are the differences between an open table format & an open data lakehouse platform? ? Is an open table format enough to realize a truly open data architecture? ? How seamlessly can we move across different platforms today? Would love to hear any thoughts! #dataengineering #softwareengineering
3 条评论

赞评论分享
Apache Hudi转发了

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Apache Hudi Specialist | Spark & AWS Glue| Data Lake Specialist | YouTuber
1 天前
举报此动态
New Blog Post Alert! Learn how to easily run Spark Streaming Hudi jobs on the latest?EMR 7.5.0?with a step-by-step guide! From creating an EMR Serverless cluster to submitting your first Hudi streaming job, this post covers everything you need to get started with real-time data processing in the cloud. Whether you're new to Spark or Hudi, this guide will help you set up your streaming pipeline efficiently and effectively. Check it out and start processing data like a pro! ???Read?the?full?blog #EMR #SparkStreaming #Hudi #AWS #BigData #RealTimeProcessing #DataEngineering #CloudComputing #TechBlogApache Hudi

Learn How to Run Spark Streaming Hudi Jobs on New EMR Serverless 7.5.0

Soumil S.，发布于领英

4 条评论

赞评论分享
Apache Hudi

9,968 位关注者
3 天前
举报此动态
Soumil joins us for Episode 3 of 'Lakehouse Chronicles with Hudi' ?? If you know Soumil, he always strives to make implementing things simple and fun. In this episode, he will go over a demo to tackle a real-world problem (CDC) of bringing data from operational sources to a lakehouse using Hudi Streamer. Specifically, the demo will cover: - capturing changes from Postgres using Debezium-Postgres connector - publishing it to Kafka topics - using Hudi Streamer in continuous mode to read from #Kafka - ingest into data lakehouse (Hudi) - syncing to HMS & querying using Trino Join here: https://lnkd.in/d5PGzZqC #dataengineering #softwareengineering
赞评论分享
Apache Hudi转发了

Dipankar Mazumdar, M.Sc ??

Staff Data Engineer Advocate @Onehouse.ai | Apache Hudi, Iceberg Contributor | Author of "Engineering Lakehouses"
3 天前
举报此动态
Concurrency Control Methods in Lakehouse. Open table formats such as Apache Hudi, Apache Iceberg & Delta Lake support concurrent access to data by multiple transactions. This is one of the most important problem tackled by a lakehouse architecture as opposed to data lakes. To ensure transactional guarantees during multiple reads & writes, there needs to be some mechanism to deal with. Concurrency control is like the "traffic police" who defines how different writers/readers coordinate access to a table. One of the commonly used methods to handle multiple write scenario is Optimistic Concurrency Control (OCC). With OCC, transactions operate without initial locks, updating data freely. At commit time, they check for conflicts by acquiring a lock and if found, the entire operation is aborted and has to be retried ?? Is there a problem with that? ? For one, you would have wasted a lot of computing resources and hence high costs. ? OCC was designed for immutable or append-only data and not great for updates/deletes. ? OCC runs with the assumption that 'conflict' rarely happens. This design principle doesn't suit for high contention environments (where you have multiple long running jobs) In a lakehouse architecture you are very much expected to face such scenarios with multiple jobs working on the same table. Apache Hudi handles these concurrency problems intelligently by distinctly separating its processes into 3 categories: - writer processes (handling user upserts & deletes) - table services (managing data and metadata for optimization & bookkeeping) - readers (executing queries) It ensures snapshot isolation among these categories, ensuring each operates on a consistent table snapshot. Hudi uses: ? optimistic concurrency control (OCC) among writers ? lock-free, non-blocking approach using MVCC between writers and table services, as well as among various table services One very important benefit with using MVCC is that you can continue to ingest new data to the table, while a table service like 'clustering' runs in the background to optimize the storage. For larger deployments (in production), this model can ease the operational burden significantly while getting the table services running without blocking the writers. To add to these strategies, Hudi has also introduced a generic "Non-blocking concurrency control" method. This allows multiple writers to write simultaneously and conflicts can be resolved later in query or via compaction. These robust strategies from the database world brings great set of control for dealing with different types of workloads in a #lakehouse. Detailed reading in comments. #dataengineering #softwareengineering
10 条评论

赞评论分享
Apache Hudi转发了

Sameer Shaik

Engineer @Bajaj Markets || Data engineer
1 个月已编辑
举报此动态
Building a Local Data Lakehouse with Apache Hudi, AWS Glue, and Streaming Reads??????? In today’s data-driven world, setting up a robust data lakehouse is essential for efficient data management and processing. Apache Hudi, when combined with AWS Glue, offers a scalable solution for incremental data processing, real-time analytics, and storage optimization. ????? In this guide, I’ll walk you through setting up a local data lakehouse environment using Docker ??, with a focus on Hudi's streaming read capabilities. By the end, you’ll be able to monitor data changes in real-time, making it perfect for testing and rapid development in a local setup. ?????? #DataEngineering #DataLakehouse #ApacheHudi #AWSGlue #StreamingData #RealTimeAnalytics #BigData #Docker #DataProcessing #ETL

Real-time Data Processing with PySpark and Apache Hudi’s Incremental Query Feature

Sameer Shaik，发布于领英

赞评论分享
Apache Hudi转发了

Dipankar Mazumdar, M.Sc ??

Staff Data Engineer Advocate @Onehouse.ai | Apache Hudi, Iceberg Contributor | Author of "Engineering Lakehouses"
4 天前
举报此动态
Incremental Processing in Lakehouse Incremental processing is an approach where only small, recent changes to data are processed, rather than processing large batches of data all at once. This method is particularly valuable in environments where data is continuously updated, enabling frequent and manageable updates. For example, take Uber - their ‘Trips’ database is crucial for providing accurate trip-related data. Previously, Uber relied on bulk uploads: ? Writing ~120 TB to Parquet every 8 hours ? The actual data changes would be less than 500 GB, but they’d still perform a full recomputation of downstream tables ? This resulted in data freshness delays of up to ~24 hours This is where Incremental Processing powered by Apache Hudi comes in. Here are some benefits of processing data incrementally: ? Efficiency Gains: Processing only recent changes reduces data volume and eases the burden on computational resources. ? Fresher Data: Systems can access up-to-date information almost in real time by minimizing delays. ? Cost Savings: Incremental updates eliminate the expense of frequent full recomputation. ? Simplified Debugging: Errors are limited to smaller increments, making them easier to identify and fix. Apache Hudi facilitates incremental processing using its Timeline, alongside features such as indexing, log merging, and incremental queries, making it indispensable for this style of data processing. For instance, in Spark SQL, you can execute an incremental query to process data from the earliest commit to the latest state using the following command: ???????????? * ???????? ????????_??????????_??????????????(‘????.??????????’, ‘????????????_??????????’, ‘????????????????’); #dataengineering #softwareengineering
6 条评论

赞评论分享
Apache Hudi

9,968 位关注者
5 天前
举报此动态
Did you check out the latest Hudi Newsletter by Onehouse? So many exciting blogs, posts and project developments from the community! Check it out: https://lnkd.in/gbdW9uk2 #dataengineering #softwareengineering #newsletter

赞评论分享
Apache Hudi

9,968 位关注者
5 天前
举报此动态
Welcome to the Episode 3 of 'Lakehouse Chronicles with Apache Hudi'. In this session, we are excited to have Soumil Shah - Sr. Software Engineer, who will present how to tackle a real-world problem (CDC) of bringing data from operational sources to a lakehouse using Hudi Streamer. Specifically, the demo will cover: - capturing changes from Postgres using Debezium-Postgres connector - publishing it to Kafka topics - using Hudi Streamer in continuous mode to read from Kafka - ingest into data lakehouse (Hudi) - syncing to HMS & querying using Trino

Ep 3: From PostgreSQL to Lakehouse using Hudi Streamer, Debezium & Kafka

www.dhirubhai.net

赞评论分享
Apache Hudi

9,968 位关注者
6 天前
举报此动态
Happening NOW - Amazon Engineering with Apache Hudi ?? Link: https://lnkd.in/dEgWpZFn
赞评论分享

相似主页

有意向到Apache Hudi工作吗？

Apache Hudi

数据基础架构与分析

San Francisco，CA 9,968 位关注者

Open source pioneer of the lakehouse reimagining batch processing with incremental framework for low latency analytics

关于我们

地点

Apache Hudi员工

Dragan V.

Search Engine Optimization Expert

动态

Learn How to Run Spark Streaming Hudi Jobs on New EMR Serverless 7.5.0

Soumil S.，发布于领英

Real-time Data Processing with PySpark and Apache Hudi’s Incremental Query Feature

Sameer Shaik，发布于领英

Ep 3: From PostgreSQL to Lakehouse using Hudi Streamer, Debezium & Kafka

www.dhirubhai.net

立即加入，查看您错过的职场动态

相似主页

Apache Iceberg

Apache XTable (Incubating)

Delta Lake

Onehouse

Apache Iceberg Workshops

Apache Doris

DuckDB

Apache Airflow

Tabular (now part of Databricks)

Polars