Getting Started with Iceberg Tables
Snowflake's support for Apache Iceberg. Iceberg Tables, now generally available, bring Snowflake's easy platform management and great performance to data stored externally in the open source Apache Iceberg format.
Prerequisites
What I did
Either download This file , or create a file named environment.yml with the following contents.
name: iceberg-lab
channels:
- conda-forge
dependencies:
- findspark=2.0.1
- jupyter=1.0.0
- pyspark=3.5.0
- openjdk=11.0.13
To create the environment needed, run the following in your shell.
conda env create -f environment.yml
Setup Snowflake
Create a database, schema, warehouse, role, and user called ICEBERG_LAB in your Snowflake account.
CREATE WAREHOUSE iceberg_lab;
CREATE ROLE iceberg_lab;
CREATE DATABASE iceberg_lab;
CREATE SCHEMA iceberg_lab;
GRANT ALL ON DATABASE iceberg_lab TO ROLE iceberg_lab WITH GRANT OPTION;
GRANT ALL ON SCHEMA iceberg_lab.iceberg_lab TO ROLE iceberg_lab WITH GRANT OPTION;;
GRANT ALL ON WAREHOUSE iceberg_lab TO ROLE iceberg_lab WITH GRANT OPTION;;
CREATE USER iceberg_lab
PASSWORD='<your desired password>',
LOGIN_NAME='ICEBERG_LAB',
MUST_CHANGE_PASSWORD=FALSE,
DISABLED=FALSE,
DEFAULT_WAREHOUSE='ICEBERG_LAB',
DEFAULT_NAMESPACE='ICEBERG_LAB.ICEBERG_LAB',
DEFAULT_ROLE='ICEBERG_LAB';
GRANT ROLE iceberg_lab TO USER iceberg_lab;
GRANT ROLE iceberg_lab TO USER <your username>;
GRANT ROLE accountadmin TO USER iceberg_lab;
3. Create an Iceberg Table
Create an External Volume
Before you create an Iceberg table, you must have an external volume. An external volume is a Snowflake object that stores information about your cloud storage locations and identity and access management (IAM) entities (for example, IAM roles). Snowflake uses an external volume to establish a connection with your cloud storage in order to access Iceberg metadata and Parquet data.
To create an external volume, complete the instructions for your cloud storage service:
Prior to creating the external volume, be sure to use the ACCOUNTADMIN role.
USE ROLE accountadmin;
Name the external volume you create iceberg_lab_vol.
After the external volume is created, use the ACCOUNTADMIN role to grant usage to the ICEBERG_LAB role.
GRANT ALL ON EXTERNAL VOLUME iceberg_lab_vol TO ROLE iceberg_lab WITH GRANT OPTION;
Create a Snowflake-managed Iceberg Table
Iceberg Tables can either use Snowflake, AWS Glue, or object storage as the catalog. In this quickstart, we use Snowflake as the catalog to allow read and write operations to tables. More information about integrating catalogs can be found here .
Create an Iceberg Table referencing the external volume you just created. You can specify BASE_LOCATION to instruct Snowflake where to write table data and metadata, or leave empty to write data and metadata to the location specified in the external volume definition.
USE ROLE iceberg_lab;
USE DATABASE iceberg_lab;
USE SCHEMA iceberg_lab;
CREATE OR REPLACE ICEBERG TABLE customer_iceberg (
c_custkey INTEGER,
c_name STRING,
c_address STRING,
c_nationkey INTEGER,
c_phone STRING,
c_acctbal INTEGER,
c_mktsegment STRING,
c_comment STRING
)
CATALOG='SNOWFLAKE'
EXTERNAL_VOLUME='iceberg_lab_vol'
BASE_LOCATION='';
Load Data
There are multiple ways to load new data into Snowflake-managed Iceberg Tables including INSERT, COPY INTO , and Snowpipe .
For this quickstart, we will INSERT data from the sample tables in your Snowflake account to an Iceberg Table. This will write Parquet files and Iceberg metadata to your external volume.
INSERT INTO customer_iceberg
SELECT * FROM snowflake_sample_data.tpch_sf1.customer;
If you check your cloud storage bucket, you should now see files that Snowflake has written as part of table creation. While Snowflake writes these files automatically, you can also use a function to generate table metadata files that capture any data manipulation language (DML) changes that have been made since the last time Iceberg metadata was generated.
Query and Time Travel
Iceberg Tables are treated much like other tables in Snowflake. For example, you can read different table types in a single query. This query is joining an Iceberg Table with a traditional Snowflake Table.
SELECT
*
FROM customer_iceberg c
INNER JOIN snowflake_sample_data.tpch_sf1.nation n
ON c.c_nationkey = n.n_nationkey;
Benefits of the additional metadata that table formats like Iceberg and Snowflake's provide are, for example, time travel.
Let's first make a simple update to the table. Then, you can see that the row count has increased compared to the previous version of the table.
INSERT INTO customer_iceberg
SELECT
*
FROM snowflake_sample_data.tpch_sf1.customer
LIMIT 5;
SELECT
count(*) AS after_row_count,
before_row_count
FROM customer_iceberg
JOIN (
SELECT count(*) AS before_row_count
FROM customer_iceberg BEFORE(statement => LAST_QUERY_ID())
)
ON 1=1
GROUP BY 2;
Governance and access controls work on Iceberg Tables just like internal tables. As described in the overview section, all of these features require Enterprise or Business Critical Edition of Snowflake.
Row-level Security
Suppose you need to control row-level access to an Iceberg Table for users having different roles. In this example, let's have a role that can see the US customers and one that only sees the non-US customers.
This can be done with a row access policy on the Iceberg Table.
USE ROLE accountadmin;
CREATE ROLE tpch_us;
GRANT ROLE tpch_us TO USER <your username>;
CREATE ROLE tpch_intl;
GRANT ROLE tpch_intl TO USER <your username>;
USE ROLE iceberg_lab;
USE DATABASE iceberg_lab;
USE SCHEMA iceberg_lab;
CREATE OR REPLACE ROW ACCESS POLICY rap_nation
AS (nation_key number) RETURNS BOOLEAN ->
('TPCH_US' = current_role() and nation_key = 24) OR
('TPCH_INTL' = current_role() and nation_key != 24)
;
ALTER ICEBERG TABLE customer_iceberg
ADD ROW ACCESS POLICY rap_nation ON (c_nationkey);
GRANT ALL ON DATABASE iceberg_lab TO ROLE tpch_intl;
GRANT ALL ON SCHEMA iceberg_lab.iceberg_lab TO ROLE tpch_intl;
GRANT ALL ON ICEBERG TABLE iceberg_lab.iceberg_lab.customer_iceberg TO ROLE tpch_intl;
GRANT ALL ON DATABASE iceberg_lab TO ROLE tpch_us;
GRANT ALL ON SCHEMA iceberg_lab.iceberg_lab TO ROLE tpch_us;
GRANT ALL ON ICEBERG TABLE iceberg_lab.iceberg_lab.customer_iceberg TO ROLE tpch_us;
GRANT USAGE ON EXTERNAL VOLUME iceberg_lab_vol TO ROLE tpch_intl;
GRANT USAGE ON EXTERNAL VOLUME iceberg_lab_vol TO ROLE tpch_us;
GRANT USAGE ON WAREHOUSE iceberg_lab TO ROLE tpch_us;
GRANT USAGE ON WAREHOUSE iceberg_lab TO ROLE tpch_intl;
There are two separate roles to grant to Snowflake users, which allow them to see a subset of customers, either international or us.
USE ROLE tpch_intl;
USE WAREHOUSE iceberg_lab;
SELECT
count(*)
FROM iceberg_lab.iceberg_lab.customer_iceberg;
USE ROLE tpch_us;
USE WAREHOUSE iceberg_lab;
SELECT
count(*)
FROM iceberg_lab.iceberg_lab.customer_iceberg;
We want the team of analysts to be able to query the customer table but not see their name(c_name), address (c_address), or phone number(c_phone). To do so, we need to grant them access to all the rows but mask those fields.
We can do that with a masking policy .
USE ROLE accountadmin;
CREATE ROLE tpch_analyst;
GRANT ROLE tpch_analyst TO USER <your username>;
USE ROLE iceberg_lab;
ALTER ROW ACCESS POLICY rap_nation
SET body ->
('TPCH_US' = current_role() and nation_key = 24) or
('TPCH_INTL' = current_role() and nation_key != 24) or
('TPCH_ANALYST' = current_role()) or
('ICEBERG_LAB' = current_role())
;
GRANT ALL ON DATABASE iceberg_lab TO ROLE tpch_analyst;
GRANT ALL ON SCHEMA iceberg_lab.iceberg_lab TO ROLE tpch_analyst;
GRANT ALL ON TABLE iceberg_lab.iceberg_lab.customer_iceberg TO ROLE tpch_analyst;
GRANT USAGE ON WAREHOUSE iceberg_lab TO ROLE tpch_analyst;
GRANT USAGE ON EXTERNAL VOLUME iceberg_lab_vol TO ROLE tpch_analyst;
USE ROLE iceberg_lab;
CREATE OR REPLACE MASKING POLICY pii_mask AS (val string) RETURNS string ->
CASE
WHEN 'TPCH_ANALYST' = current_role() THEN '*********'
ELSE val
END;
ALTER ICEBERG TABLE customer_iceberg MODIFY COLUMN c_name SET MASKING POLICY pii_mask;
ALTER ICEBERG TABLE customer_iceberg MODIFY COLUMN c_address SET MASKING POLICY pii_mask;
ALTER ICEBERG TABLE customer_iceberg MODIFY COLUMN c_phone SET MASKING POLICY pii_mask;
USE ROLE tpch_analyst;
SELECT
*
FROM customer_iceberg;
Other governance features can be applied to Iceberg Tables, including object tagging , and tag-based masking .
Monitor Governance in Snowsight
As a data administrator, you can use the built-in Dashboard and Tagged Objects interfaces to monitor and report on the usage of policies and tags with tables, views, and columns. This includes policies and tags applied to Iceberg Tables.
Using the ACCOUNTADMIN role, or an account role that is granted the GOVERNANCE_VIEWER and OBJECT_VIEWER database roles, click Data ? Governance to navigate to these interfaces. You can see the policies applied to the Iceberg Table.
5. Processing Iceberg Tables with Snowpark
Snowpark is a set of libraries and runtimes in Snowflake that securely deploy and process non-SQL code, including Python, Java and Scala. Snowpark can be used for data science and data engineering pipelines, with key benefits including:
You can interact with Iceberg Tables using DataFrames that are lazily executed. Let's try this by first creating an empty Iceberg table.
USE ROLE iceberg_lab;
USE DATABASE iceberg_lab;
USE SCHEMA iceberg_lab;
CREATE OR REPLACE ICEBERG TABLE nation_orders_iceberg (
regionkey INTEGER,
nationkey INTEGER,
nation STRING,
custkey INTEGER,
order_count INTEGER,
total_price INTEGER
)
CATALOG = 'SNOWFLAKE'
EXTERNAL_VOLUME = 'iceberg_lab_vol'
BASE_LOCATION = '';
Now create a new Python worksheet. Ensure ICEBERG_LAB role and warehouse are selected in the top-right, and ICEBERG_LAB database and schema are selected in your worksheet. After running the worksheet, you will see the data that was saved to the NATION_ORDERS_ICEBERG table.
import snowflake.snowpark as snowpark
from snowflake.snowpark import functions as sf
def main(session: snowpark.Session):
# Create a DataFrame representing the 'orders' table
df_orders = session.read.table("SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.ORDERS")
# Perform aggregation on the DataFrame
df_orders_agg = (
df_orders
.groupBy(df_orders.o_custkey)
.agg(
sf.count(df_orders.o_orderkey).alias("order_count"),
sf.sum(df_orders.o_totalprice).alias("total_price")
)
)
df_orders_agg = df_orders_agg.select("o_custkey", "order_count", "total_price")
df_customer = session.read.table("ICEBERG_LAB.ICEBERG_LAB.CUSTOMER_ICEBERG")
df_nation = session.read.table("SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.NATION")
df_nation_customer = df_customer.join(df_nation, df_customer.col("c_nationkey") == df_nation.col("n_nationkey")).select("c_custkey", df_nation["n_nationkey"].as_("nationkey"), df_nation["n_name"].as_("nation"), df_nation["n_regionkey"].as_("regionkey"))
df_nation_customer_orders_agg = df_nation_customer.join(df_orders_agg, df_nation_customer.col("c_custkey") == df_orders_agg.col("o_custkey")).select("regionkey", "nationkey", "nation", df_nation_customer["c_custkey"].as_("custkey"), "order_count", "total_price")
df_nation_customer_orders_agg = df_nation_customer_orders_agg.select("regionkey", "nationkey", "nation", "custkey", "order_count", "total_price")
# Save result to iceberg table
df_nation_customer_orders_agg.write.mode("append").save_as_table("nation_orders_iceberg")
return df_nation_customer_orders_agg
6. Sharing Iceberg Tables
Iceberg Tables can be securely shared with consumers either through their own Snowflake account or a provisioned Snowflake Reader account. The consumer can be an external entity or a different internal business unit that is required to have its own unique Snowflake account.
Cross-cloud and cross-region sharing of Iceberg Tables is not currently supported. The provider's external volume, Snowflake account, and consumer's Snowflake account must all be in the same cloud region.
With data sharing, including Iceberg Tables:
Suppose you have a partner who wants to analyze the data in your ICEBERG_LAB database on a near real-time basis. This partner also has their own Snowflake account in the same region as our account. Data sharing is an easy, secure solution to allow them to access this information.
Creating a Reader Account
For the purposes of this lab, we'll share data with a provisioned reader account. Return to your SQL worksheet, and grant the ICEBERG_LAB role the ability to create a reader account.
USE ROLE accountadmin;
GRANT CREATE ACCOUNT ON ACCOUNT TO ROLE iceberg_lab;
USE ROLE ICEBERG_LAB;
Exit your SQL worksheet and navigate to Private Sharing, then click the tab Reader Accounts near the top of your window, then click + New. Use ICEBERG_LAB_READER as the Account Name, READER_ADMIN as the User Name, and provide a password. Then click Create Account. You'll see the reader account now listed.
Create a Secure View
Now create a secure view which is what will eventually be shared with the ICEBERG_LAB_READER account.
USE ROLE iceberg_lab;
CREATE OR REPLACE SECURE VIEW nation_orders_v AS
SELECT
nation,
SUM(order_count) as order_count,
SUM(total_price) as total_price
FROM nation_orders_iceberg
GROUP BY nation;
Create an Outbound Share
By default, ACCOUNTADMIN is the only role that can create shares. From your SQL worksheet, grant the ICEBERG_LAB role to create a share, then use this role.
USE ROLE accountadmin;
GRANT CREATE SHARE ON ACCOUNT TO ROLE iceberg_lab;
USE ROLE iceberg_lab;
Exit the SQL worksheet and navigate to Data ? Private Sharing, then click on the Shared by My Account tab near the top of your window, then click the Share button in the top-right corner and select Create a Direct Share.
Accessing Shared Data
In a separate browser tab, login to the reader account previously created. After logging in, as this is a new account, create a new SQL worksheet..
USE ROLE accountadmin;
CREATE OR REPLACE WAREHOUSE iceberg_lab_reader
WAREHOUSE_SIZE = XSMALL
AUTO_SUSPEND = 1
AUTO_RESUME = 1
INITIALLY_SUSPENDED = TRUE;
Now let's view the shared data. Exit the worksheet, and in the pane on the left, click on on Data ? Private Sharing. You will see ICEBERG_LAB_NATION_ORDERS_SHARED_DATA listed under Direct Shares. Select Get Data and name it READER_ICEBERG_LAB_NATION_ORDERS_SHARED_DATA. Make it available to the PUBLIC role, then click Get Data, then click View Database.
You can now query the shared data, which is a view on top of an Iceberg table. Run the queries below in a SQL worksheet to create a warehouse and see which countries had the most orders.
SELECT *
FROM reader_iceberg_lab_nation_orders_shared_data.iceberg_lab.nation_orders_v
ORDER BY order_count DESC;
7. Access Iceberg Tables from Apache Spark
Suppose another team that uses Spark wants to read the Snowflake-managed Iceberg Table using their Spark clusters. They can use the Snowflake Iceberg Catalog SDK to access snapshot information, and directly access data and metadata in object storage, all without using any Snowflake warehouses.
From your terminal, run the following commands to activate the virtual environment you created in the setup, and open jupyter notebooks.
conda activate iceberg-lab
jupyter notebook
Download the notebook iceberg_lab.ipynb provided here , then open from Jupyter. Update and run the cells that are applicable to the cloud in which your Snowflake account is located.
8. Cleanup
To delete all of the objects created in this guide, you can drop the user, role, database, and warehouse.
USE ROLE iceberg_lab;
DROP SHARE iceberg_lab_nation_orders_shared_data;
DROP DATABASE iceberg_lab;
USE ROLE accountadmin;
DROP EXTERNAL VOLUME iceberg_lab_vol;
DROP USER iceberg_lab;
DROP ROLE iceberg_lab;
DROP ROLE tpch_us;
DROP ROLE tpch_intl;
DROP ROLE tpch_analyst;
DROP WAREHOUSE iceberg_lab;
To delete the Conda environment, run the following in your shell.
conda deactivate
conda remove -n iceberg-lab --all
9. Conclusion
Snowflake’s support for Apache Iceberg tables enables enhanced data management and governance, combining Snowflake’s performance benefits with the open-source flexibility of Iceberg. This integration allows users to store data externally in Iceberg format, apply governance policies, and leverage Snowpark for streamlined data pipelines. The platform also supports secure data sharing and efficient cross-functional access, along with robust options for managing Iceberg tables using Apache Spark. With these capabilities, Snowflake facilitates efficient, scalable, and secure data handling for modern data engineering needs.