Advanced Techniques for Optimizing Apache Iceberg Lakehouse Performance
Apache Iceberg is enjoying rapid adoption in the data lake ecosystem, as engineers discover the benefits of a data warehouse-like experience on their lakes.
In a recent webinar, hosted by Upsolver’s Jason Hall , we explored how you can optimize your Iceberg lakehouse to keep performance high and costs low.
Read on to discover Jason’s top tips for keeping your lakehouse running at full speed.
Objectives
1 - Explore the underlying components that make Iceberg work
2 - Discover considerations and techniques to improve Iceberg performance
3 - Learn where Upsolver fits into the Iceberg Lakehouse architecture.
1. What is a Lakehouse?
Fundamentally, a lakehouse extends the concept of a data lake to provide metadata management and a more warehouse-like experience on data.
If you think of your data lake, it’s just a bucket of files with no real organization to them, a lakehouse adds the organization and metadata schema management that’s missing, along with access control, and transactional support.
Until the rise of the open table format introduced with lakehouse architecture, many of the common features you’re no doubt familiar with from the database or warehouse worlds, have been cumbersome to manage on data lakes.
The components of the lakehouse architecture, which are all managed for us in a database or warehouse, are now modular, enabling us to choose the tooling and options we want at each level of the stack:
What is Iceberg?
Iceberg is an open-source project that is quickly becoming a leading standard for open lakehouse table formats, excelling as a high-performance format for huge analytic tables. Iceberg has many unique capabilities that help engineers and developers manage their lakehouse data.
Data Files
Iceberg supports a variety of file formats, including Parquet, Avro, and ORC. The data files are exactly that - where the data lives - and are at the lowest level in the lakehouse layers:
Source: https://www.dremio.com/resources/guides/apache-iceberg-an-architectural-look-under-the-covers/ ?
Manifest Files
On top of the data layer is the metadata layer, comprising manifest files that describe the location and contents of each data file, and some high-level statistics. In most Iceberg tables you will have multiple manifest files to support parallelism, and each manifest file will describe a group of data files, and all manifest files work together to support a table.
Above the manifest file, is a manifest list, which is a list of manifest files that make up the contents of a table. At the top of the metadata layer, we have a metadata file, which contains one or more snapshots, and contains the schema information, the definition of the partition schema, and information about each snapshot for a table and which is current.?
Catalog
At the very top of the stack, we have the Iceberg Catalog, which maps the table definition to the current metadata file. AWS Glue and Tabular are two of the catalogs that support the Iceberg catalog.
Snapshots
Snapshots provide a point-in-time copy of the state of an Iceberg table and are used to provide “time travel” to historical points. The metadata file contains the list of table snapshots, enabling you to query a previous state of the data. This feature brings much power to the existing data lake architecture.?
2. Considerations that Affect Iceberg Performance
Partitioning
Partitioning is simply the process of organizing rows that are often queried together. If you have a database background, it might be helpful to think of this as indexing for lakehouse tables, whereby data is grouped to make it more efficient to query, thereby eliminating table scans.
Iceberg Hidden Partitioning
With Iceberg, partitions are defined as table configurations, making them much less explicit than Hive partitions. Whereas Hive partitions need to be defined as explicit columns when the table is created and data is written, data inserted into an Iceberg table is automatically partitioned based on the table configuration.?
Furthermore, Iceberg handles the process of producing a partition value, and partition columns do not need to be explicitly referenced in queries that read from the table.
If you have a partition based on day and are writing data using a timestamp, Iceberg would know where to write the data, unlike Hive, which needs the column data to exactly match the partition.
The following example shows the logs table partitioned by the event time event_ts column, using the days() function. In the INSERT statement, the query is writing a timestamp value into the events_ts column, but Iceberg is clever enough to accept this value and partition it using the day portion of the timestamp:
Handling the mapping of the partition key improves the impact that partitioning has on query performance, removing much of the burden of partitioning from the writer and the reader, and putting it on the table configuration.
Iceberg enables you to modify the partition at any time, simply by using an ALTER statement, as the partition configuration is maintained in the metadata files. However, this only applies to new snapshots created for that table. You would need to rewrite the data to copy it to a new partition.
Sorting
In general, performance is optimized when rows queried together are “close” to each other within a data file. Partitioning is one way to ensure this, so if most of your queries are based on a date and you have partitioned by date, it means you don’t need to scan the entire table to find the matching results. This is the first step to ensuring that data that is queried together is stored together, and sorting is the next part of this.?
Column-level statistics are stored within the metadata, so you can define an Iceberg table that sorts by several columns. When the data is sorted, the statistics that appear in the Iceberg metadata will help the query engine determine which files to scan. In the metadata files, the statistics will include min and max values for each column, so the query engine knows which files have the matching data, as we can see in the following diagram:?
So if you regularly query on a value other than the partition value, it’s important to add sorting to the table. Sorting is suitable for queries that filter only on a single column, but for multiple dimensions, consider Z-ordering. Check out this blog from our friends over at Dremio to learn how Z-ordering works.?
领英推荐
Small Files
Each write to an Iceberg table creates a new snapshot (version) of the table, so if your source data is written once a day, this would generate one new snapshot a day.
When it comes to streaming, this isn’t so simple. If your stream writes data every minute, this will create a new snapshot every minute. Each snapshot contains a metadata file, that points to the manifest and data files that comprise the new snapshot, therefore generating a lot of snapshots, manifest, and data files. And while these files are likely to be small - there are lots of them.?
Furthermore, if you have sorting on the Iceberg table, and the data arrives every minute, the data within each minute will be sorted, reducing its effectiveness.
Querying can be expensive in terms of performance and storage. If you write a new snapshot every minute, over the course of two days, you will generate 2,880 files. To query the data for the last two days, you would need to open the 2,880 files - and also the metadata and manifest files - read the data out, and then close each file. This is a lot of overhead just to open and close all the files. On top of that, you need to factor in merge-on-read data files, which need to be applied across all the files you open to produce consistent results.?
Iceberg recommends regular maintenance to drive down these costs, which is important when performing frequent writes. The Iceberg documentation recommends you always perform the following three maintenance tasks:
When expired, a snapshot is removed from the metadata and is no longer available for time travel. It is unlikely that you will need an infinite number of snapshots, so determine how long the business will need the snapshots.
There are two configuration settings that you can use to clean up metadata files, which will enable Iceberg to reduce the number of files and free up storage space.
These are data files that are no longer referenced by a metadata file. There are various scenarios that can result in orphaned files, such as a job failing, so it’s good practice to clean them up once in a while.?
Additional Maintenance
Compaction is very much a recommended maintenance operation for streaming use cases, as this will reduce small files into larger files, speeding up data reads, and in turn speeding up queries. When compacting lots of small files into fewer, larger files, you will also see improvements in sorting.?
You can configure the compaction operation using the rewriteDataFiles procedure to define how far back to include files for compaction and also set a target data file size.
When streaming data into Iceberg, compaction needs to be run frequently. Be aware, however, that every time you run compaction, there will be a compute requirement, so you will need to weigh up the resource cost for compaction versus the increased query performance.?
Sharing Iceberg Tables Externally
Iceberg has gained widespread adoption by delivering on its promise of a “shared storage” ecosystem, allowing a variety of warehouses and query engines to access a single source of data.
We can have a single Iceberg table that feeds dozens of use cases, eliminating vendor lock-in as data is held in Iceberg’s open table format and not a proprietary format.
The diagram below demonstrates that, whoever is managing your Iceberg tables, tables can be queried by any compatible engine, whether Snowflake, Data Bricks, Athena, Redshift, Trino, Dremio, Presto… and so the list goes on:
Many of these tools are strong for a specific use case, such as Snowflake for analytics, and may want to continue using them after creating your lakehouse. Rather than having to write the data into a proprietary format for Snowflake users to consume, the data can be queried directly in Iceberg from Snowflake.?
Currently, industry support for Iceberg is inconsistent, and some vendors have varying support for:
These limitations at best can make performance unreliable, at worst, can make the integration non-functional.?
Lessons Learned
Ensure you test Iceberg within the ecosystem in which you plan on using it. After you create your Iceberg tables, ingest into them, and query them to see how they perform. Test, test, test...
Join and contribute to the very active community, both Iceberg and third-party vendors. The community is iterating very quickly, so be an active part of it. Some of the limitations you experience within a tool may be addressed very shortly, so keep testing to ensure everything works!
Upsolver
At Upsolver, we believe we have the most complete shared storage platform based on Iceberg:
We have a management and optimization tool for your existing Iceberg lakehouse, regardless of the ETL tool you use. The tool applies best-practice performance techniques to make the table as efficient as possible. Many of the manual administration tasks mentioned above are built into this tool, so you don’t have to worry about applying them.
The first tool we offer is the Iceberg Table Analyzer, which is open-source and free of charge. The analyzer runs a health check on your Iceberg tables, and generates a report to show the potential performance gains to be made by compacting and tuning the table:?
If you find a table that needs optimization, we have another tool that can help. Our Iceberg Table Optimizer continuously and automatically applies Iceberg engineering best practices, by doing the following:
The Iceberg Table Optimizer integrates both a health check of the table and optimization. Simply select the tables you want Upsolver to manage, and we take care of the rest.?
Takeaways
If you have any questions for Jason, please leave a comment or reach out to Jason directly. Here are a few links to get you started.
Finally, a big shout out to our friends at Dremio and Tabular (now part of Databricks) for the use of their graphics!