10 Future Apache Iceberg Developments to Look forward to in 2025
Alex Merced
Co-Author of “Apache Iceberg: The Definitive Guide” | Senior Tech Evangelist at Dremio | LinkedIn Learning Instructor | Tech Content Creator
Apache Iceberg remains at the forefront of innovation, redefining how we think about data lakehouse architectures. In 2025, the Iceberg ecosystem is poised for significant advancements to empower organizations to handle data more efficiently, securely, and at scale. The year ahead promises to be transformative, from enhanced interoperability with modern data tools to new features that simplify data management. In this blog, we’ll explore 10 exciting developments in the Apache Iceberg ecosystem that you should watch, offering a glimpse into the future of open data lakehouse technology.
1. Scan Planning Endpoint in the Iceberg REST Catalog Specification
One of the most anticipated updates in the Iceberg ecosystem for 2025 is the addition of a “Scan Planning” endpoint to the Iceberg REST Catalog specification. This enhancement will allow query engines to delegate scan planning — the process of reading metadata to determine which files are needed for a query — to the catalog itself. This new capability opens the door to several exciting possibilities:
Introducing this endpoint is a step toward improving query performance and a glimpse into a future where catalogs become the central hub for table format compatibility. A similar endpoint for handling metadata writes may be introduced to fully realize this vision, further extending the catalog’s capabilities.
2. Interoperable Views in Apache Iceberg
Interoperable views are another major development to watch in the Apache Iceberg ecosystem for 2025. While Iceberg already supports a view specification, the current approach has limitations: it stores the SQL used to define the view, but since SQL syntax varies across engines, resolving these views is not always feasible in a multi-engine environment.
To address this challenge, two promising solutions are being explored:
These advancements aim to make views in Iceberg truly interoperable, allowing seamless sharing and resolution across different engines and workflows. Whether through SQL transpilation or an intermediate format, these improvements will significantly enhance Iceberg’s flexibility in heterogeneous data environments.
3. Materialized Views in Apache Iceberg
A materialized view stores a query definition as a logical table, with precomputed data that serves query results. By shifting the computational cost to precomputation, materialized views significantly improve query performance while maintaining flexibility. The Iceberg community is working towards a common metadata format for materialized views, enabling their creation, reading, and updating across different engines.
Key Features of Iceberg Materialized Views
Storage Table State Management:
Refresh Mechanisms: Materialized views can be refreshed through various methods, including event-driven triggers, query-time checks, scheduled refreshes, or manual operations. These methods ensure the precomputed data remains relevant to the underlying data.
Query Optimization: Queries can use precomputed data directly if it meets freshness criteria (e.g., the materialization.data.max-staleness property). Otherwise, the query engine determines the next steps, such as refreshing the data or falling back to the original view definition.
Interoperability and Governance: The shared metadata format supports lineage tracking and consistent states, making materialized views easy to manage and audit across engines.
Impact on the Iceberg Ecosystem
Materialized views in Iceberg offer a way to optimize query performance while ensuring that optimizations are portable across systems. Iceberg hopes to enable organizations to harness the benefits of materialized views without being locked into specific query engines by providing a standard for metadata and refresh mechanisms. This development will make Iceberg an even more compelling choice for building scalable, engine-agnostic data lakehouses.
4. Variant Data Format in Apache Iceberg
The upcoming introduction of the variant data format in Apache Iceberg marks a significant advancement in handling semi-structured data. While Iceberg already supports a JSON data format, the variant data type offers a more efficient and versatile approach to managing JSON-like data, aligning with the Spark variant format.
How Variant Differs from JSON
The variant data format is designed to provide a structured representation of semi-structured data, improving performance and usability:
Benefits of the Variant Format
5. Native Geospatial Data Type Support in Apache Iceberg
The integration of geospatial data types into Apache Iceberg is poised to open up powerful capabilities for organizations managing location-based data. While geospatial data has long been supported by big data tools like GeoParquet, Apache Sedona, and GeoMesa, Iceberg’s position as a central table format makes the addition of native geospatial support a natural evolution. Leveraging prior efforts such as Geolake and Havasu, this proposal aims to bring geospatial functionality into Iceberg without the need for project forks.
Proposed Features
The geospatial extension for Iceberg will introduce:
Key Use Cases
Table Creation with Geospatial Types:
CREATE TABLE geom_table (geom GEOMETRY);
Inserting Geospatial Data
INSERT INTO geom_table VALUES ('POINT(1 2)', 'LINESTRING(1 2, 3 4)');
Querying with Geospatial Predicates:
SELECT * FROM geom_table WHERE ST_COVERS(geom, ST_POINT(0.5, 0.5));
Geospatial Partitioning:
ALTER TABLE geom_table ADD PARTITION FIELD (xz2(geom));
CALL rewrite_data_files(table => `geom_table`, sort_order => `hilbert(geom)`);
Benefits
6. Apache Polaris Federated Catalogs
Apache Polaris is expanding its capabilities with federated catalogs, allowing seamless connectivity to external catalogs such as Nessie, Gravitino, and Unity. This feature makes the tables in these external catalogs visible and queryable from a Polaris connection, streamlining Iceberg data federation within a single interface.
Current State
Polaris currently supports read-only external catalogs, enabling users to query and analyze data from connected catalogs without duplicating data or moving it between systems. This functionality simplifies data integration and allows users to leverage the strengths of multiple catalogs from a centralized Polaris environment.
Future Vision: Read/Write Federation
There is active discussion and interest within the community to extend this capability to read/write catalog federation. With this enhancement, users will be able to:
Key Benefits of Federated Catalogs
The Road Ahead
The move toward read/write federation make it easier for organizations to manage diverse data ecosystems. By bridging the gap between disparate catalogs, Polaris continues to simplify data management and empower users to unlock the full potential of their data.
7. Table Maintenance Service in Apache Polaris
A feature beign discussed in the Apache Polaris community is the table maintenance service, designed to streamline table optimization and maintenance workflows. This service would function as a notification system, broadcasting maintenance requests to subscribed tools, enabling automated and efficient table management.
How It Could Works
The table maintenance service allows users to configure maintenance triggers based on specific conditions. For example, users could set a table to be optimized every 10 snapshots. When this condition is met, the service broadcasts a notification to subscribed tools such as Dremio, Upsolver and any other service that optimizes Iceberg tables.
Key Use Cases
Benefits
8. Catalog Versioning in Apache Polaris
Catalog versioning, a transformative feature currently available in the Nessie catalog, is under discussion for inclusion in the Apache Polaris ecosystem. Adding catalog versioning to Polaris would unlock a range of powerful capabilities, positioning Polaris as a unifying force for the most innovative ideas in the Iceberg catalog space.
The Power of Catalog Versioning
Catalog versioning provides a robust foundation for advanced data management scenarios by enabling:
Proposed Integration with Polaris
Discussions around bringing catalog versioning to Polaris also involve designing a new model that aligns with Polaris’ architecture. This integration could enable:
Potential Impact
If implemented, catalog versioning in Polaris would elevate its capabilities, making it an indispensable tool for organizations looking to modernize their data lakehouse operations.
9. Updates to Iceberg’s Delete File Specification
Apache Iceberg’s innovative delete file specification has been central to enabling efficient upserts by managing record deletions with minimal performance overhead. Currently, Iceberg supports two types of delete files:
While these mechanisms are effective, each comes with trade-offs. Position deletes can lead to high I/O costs when reconciling deletions during queries, while equality deletes, though fast to write, impose significant costs during reads and optimizations. Discussions in the Iceberg community propose enhancements to both approaches.
Proposed Changes to Position Deletes
The key proposal is to transition position deletes from their current file-based storage to deletion vectors within Puffin files. Puffin, a specification for structured metadata storage, allows for compact and efficient storage of additional data.
Benefits of Storing Deletion Vectors in Puffin Files:
Reimagining Equality Deletes for Streaming
Another area of discussion is rethinking equality deletes to better suit streaming scenarios. The current design prioritizes fast writes but incurs steep costs for reading and optimizing. Possible enhancements include:
Impact of These Changes
10. General Availability of the Dremio Hybrid Catalog
The Dremio Hybrid Catalog, currently in private preview, is set to become generally available sometime in 2025. Built on the foundation of the Polaris catalog, this managed Iceberg catalog is tightly integrated into Dremio, offering a streamlined and feature-rich experience for managing data across cloud and on-prem environments.
Key Features of the Hybrid Catalog
Benefits of the Dremio Hybrid Catalog
Impact on the Iceberg Ecosystem
The general availability of the Dremio Hybrid Catalog will mark a significant milestone for organizations adopting Iceberg. By integrating Polaris’ advanced capabilities into a managed catalog, Dremio is poised to deliver a seamless and efficient solution for managing data lakehouse environments. This innovation underscores Dremio’s commitment to making Iceberg a cornerstone of modern data management strategies.
Conclusion
As we look ahead to 2025, the Apache Iceberg ecosystem is set to deliver groundbreaking advancements that will transform how organizations manage and analyze their data. From enhanced query optimization with scan planning endpoints and materialized views to broader support for geospatial and semi-structured data, Iceberg continues to push the boundaries of data lakehouse capabilities. Exciting developments like the Dremio Hybrid Catalog and updates to delete file specifications promise to make Iceberg even more efficient, scalable, and interoperable.
These innovations highlight the vibrant community driving Apache Iceberg and the collective effort to address the evolving needs of modern data platforms. Whether you’re leveraging Iceberg for its robust cataloging features, seamless multi-cloud support, or cutting-edge query capabilities, 2025 is shaping up to be a year of remarkable growth and opportunity. Stay tuned as Apache Iceberg continues to lead the way in open data lakehouse technology, empowering organizations to unlock the full potential of their data.