Unlocking the Potential of Apache Iceberg
When Apache Iceberg began making waves in the data lake ecosystem, the Upsolver team was quick to recognize its significance and figured this new technology was here to stay. To spread the word and dive deeper into the lakehouse and open table format architecture, we organized and hosted our first Chill Data Summit, held in NYC.?
Despite the chilly weather, the atmosphere was buzzing with excitement as we welcomed a stellar lineup of speakers from the data industry. Here is a summary of the talks from the Upsolver team, starting with Santona Tuli, Ph.D., who opened the Chill Data Summit with an introduction to Apache Iceberg and the open table format.
Why Open Table Formats and Why Now
Why now? If we look back on the evolution of how we store and access data it’s easier to understand the path that now leads us to the lakehouse. For decades, databases provided us with natural storage for our operational data, but they were never designed for running analytics so the data warehouse was invented.
However, the data warehouse has always been a concept for modeling business data for insights and, with the data already modeled, it is not adaptable or suited for ML use cases.
Data lakes bypass the prescriptive modeling and interface of a data warehouse, making it ideal for ML. The open format makes it modular and plugable, unlocking a multitude of further uses, and is also less costly than a data warehouse.
But, by giving us more control over our data, this requires more engineering know-how to continuously maintain the lake, and lakes are not as easy to query as a structured database or warehouse. The trade-off for having optimized storage is the ability to easily query the data: users must know the mapping of what data lives where in a lake.
This is where the lakehouse comes in, combining the best of all three technologies:?
By leveraging these features, the lakehouse gives us efficiency and performance, compaction, Merge-on-Read for processing frequent changes, and a model of our business that is closer to what we have in our data warehouse.?
So how does the open table format fit in? Whereas a cloud warehouse provider uses a proprietary table format, the open table format is vendor-agnostic and portable, creating an industry standard that can be queried by any engine. This enables you to switch out your catalog or query engine, and not the table format.
Growing volumes of streaming data demand a better way of storing, managing, and querying our data, and Apache Iceberg arrives just in time to help us build a better data future.
Performance & Ease-of-use at Lake Scale
The Chill Data Summit was the perfect platform for Upsolver’s founders, Ori Rafael , and Yoni Eini , to announce support for Apache Iceberg with three big new features.?
Firstly, building on Upsolver’s existing big data ingestion platform is the ability to ingest high-scale streaming data to Iceberg tables. Using our zero-ETL tool, you can ingest data from database, stream, and file sources into your lakehouse and benefit from automatic compaction and tuning on your tables to ensure your performance remains high!
Not only do we handle schema evolution, eliminate bad data, and guarantee strongly ordered and exactly once data, but we also remove the administrative overhead of optimizing your tables.?
If you’re not using Upsolver to ingest your data but have existing Iceberg tables, we can offer you an easy table optimization solution. With our Iceberg Table Optimizer tool, it’s as simple as connecting to your data catalog and selecting the tables you want to optimize.
Upsolver’s analyzer calculates the potential storage cost savings to be made, along with faster data scans to speed up your queries. You can set the optimizer to run continuously in the background so you don’t have to figure out the best time to do this yourself.?
Lastly, we launched an open-source Iceberg Table Analyzer tool that you can download for free. Install the CLI and run it against your Iceberg tables to uncover potential storage savings and faster query performance.
领英推荐
However you’re using Apache Iceberg, check out Upsolver’s solutions to help you get the most out of your Iceberg lakehouse, and watch the recording to learn more about Upsolver.
From 5.5 hours to 39 seconds: Adapting Iceberg for High-Scale Streaming Updates
Jason Fine 's experience with streaming data and solving the challenges that come with high-scale data was recently put to the test when he encountered a query that took over five and a half hours and needed a better plan of execution.
Data lakes are not best suited to handling streaming data with frequent updates and deletes, as every time a change is made, it forces the creation of a new data file, which creates an IO overhead in writing the data, and means the reader has a lot of small files to open to satisfy a query.?
Iceberg offers two approaches for writing data: Copy-on-Write (CoW) and Merge-on-Read (MoR). The CoW approach writes the data as it lands in the data lake, making it efficient for frequent reads but fewer updates.
On the other hand, MoR merges the changes into the data file when a query is executed. If you have streaming data, this is more performant, as high-scale changes can be written into a change file at pace. However, issues arise when reading the data and applying the changes, which was at the heart of the problem in Jason’s query.?
With MoR, you can choose from Position Delete File or Equality Delete File to apply the deletes to the data files. As their name suggests, the position delete will delete a row based on its position in the data file, and the quality delete matches the value in a field to the delete value.
Jason’s query used the Equality Delete File method, but because each delete file must be applied against every data file that is read to satisfy the query, opening and reading the delete file for every data file, as well as the data file, caused very high IO and hindered performance.?
And so the solution was to create the Upsolver Iceberg Streaming Data API that massively reduced the number of manifest files that were being generated in Iceberg, and took the query down from five and a half hours to well under a minute.
Check out the recording for the full story and gain a deep insight into the inner workings of Iceberg!?
The Open Lakehouse Deserves an Open Ecosystem
As we learned from Santona’s talk, the open table format exposes data to any query engine as it is stored in a non-proprietary table format. This is one aspect of the open lakehouse that Jason Hall believes is beneficial for creating opportunities for new use cases on your data.?
Jason gave a demo of ingesting high-scale data into Iceberg using Upsolver's zero-ETL solution and showed us how to query those Iceberg tables from different tools, including Dremio , Starburst , and Snowflake .
Adopting Iceberg protects us from vendor lock-in, both for the storage of our data and how we can access it. This is great news for both the tools market, and us as consumers to choose how to manage and query our data, now that it is “opened up” in the lakehouse.?
One exciting use for your lakehouse is to query the data as a graph. Using PuppyGraph , Jason demonstrated how you can connect PuppyGraph to the tables in your lakehouse that Upsolver created for you as part of your ingestion job. Because we continuously tune your tables, PuppyGraph will always be working on optimized data, ensuring the fastest possible results.
Watch Jason's talk to see Upsolver and PuppyGraph in action.
If you missed out on the Chill Data Summit, all the recordings are available here, where you will be able to see the accompanying slides and demos.
Have you started an Iceberg project yet, or have one in the planning? Why not book an appointment with one of our Solutions Architects so we can help?