登录查看更多内容

Resources for Learning more about Catalog level versioning with Project Nessie & Dremio Arctic (Rollbacks, Branching, Tagging and Multi-Table Txns)

Alex Merced

Co-Author of “Apache Iceberg: The Definitive Guide” | Head of DevRel at Dremio | LinkedIn Learning Instructor | Tech Content Creator

发布日期: 2023年5月10日

Data Quality, Governance, Observability, and Disaster Recovery are issues that are still trying to discover best practices in the world of the data lakehouse. A new trend is rising, borrowing from the practices used by software developers to manage these issues with code bases. This trend is called "Data as Code". Many of the practices this trend is trying to bring to the Lakehouse include:

Versioning to enable isolating work on branches or marking particular reproducible states through tagging
Commits to enable time travel and rollbacks
The ability to use branching and merging to make atomic changes to multiple objects at the same time ( in data terms, multi-table transactions)
Capturing data in commits to build audibility of who is making what changes and when
The ability to govern who can access
Automating the integration of changes (Continuous Integration) and automating publishing of those changes (Continuous Deployment) via CI/CD pipelines

Project Nessie

Several solutions are arising in approaching this problem from different layers, such as the catalog, table, and file levels. Project Nessie is an open-source project that solves these problems from the catalog level. Benefits of Nessie's particular approach:

Isolate ingestion across your entire catalog by branching it, allowing you to audit and inspect data before publishing without exposing it to consumers and without having to make a "staging" copy of the data (branches do not create data copies like git branches don't duplicate your code).
Make changes to multiple tables from a branch, then merge those changes as one significant atomic multi-table transaction.
If a job fails or works in unintended ways, instead of rolling back several tables individually, you can roll back all your tables by rolling back your catalog.
Manage access to the catalog, limiting which branches/tables a user can access and what kind of operations they can run on it.
Commit logs can be used as an audit log to have visibility to your catalog updates.
Nessie operations can all be done via SQL, making it more accessible to data consumers.
Portability of your tables as they can be accessed by any tool with Nessie support, such as Apache Spark, Apache Flink, Dremio, Presto, and more.

Project Nessie Resources

Tutorials:

Dremio Arctic

While you can deploy your own Nessie server, you can have a cloud-managed one with some extra features using the Dremio Arctic service. Beyond the amazing catalog-level versioning features that you get with having a Nessie catalog for your tables, Dremio Arctic also provides:

Automatic table optimization services
Easy and Intuitive UI to view commit logs, manage branches, and more
Easy integration with the Dremio Sonar Lakehouse query engine
Zero Cost to get a catalog up and running in moments with a Dremio Cloud account

领英推荐

The Database as Code Landscape

Bytebase - Database CI/CD and Security at Scale 8 个月前

What problems does Docker really solve? ??

Nana Janashia 5 年前

October 19, 2024

Kannan Subbiah 5 个月前

Dremio Arctic Resources

CI/CD

Essentially you can create automated pipelines that take advantage of Nessies branching using any tool that supports Nessie for example:

Orchestration Tools
CRON Jobs
Severless Functions

These mechanisms can be used to send instructions to Nessie supporting tools like Dremio and Apache Spark. For example:

Data Lands on S3 triggering a python scripts that sends the appropriate SQL queries to Dremio via Arrow Flight, ODBC or REST
A pySpark script that runs on a schedule sending instructions to a Spark script

The jobs would follow a similar pattern too:

Create a branch
Switch to the branch
make updates
validate updates
if validations are successful, merge changes
if validations fail, generate error with details for remediation (consumers never exposed to inconsistent or incorrect data)

Resources on CI/CD with Arctic/Nessie

DataOps in action with Nessie, Iceberg and Great Expectations
Subsurface LIVE 23 | CI CD on the Lakehouse - Making Data Changes and Repair Safe and Easy

Data Lakehouse Bytes with Alex

5,860 位关注者

要查看或添加评论，请登录

Alex Merced的更多文章

Iceberg REST Catalog Overview #10 — Registering Tables with the Catalog

2025年3月20日

Iceberg REST Catalog Overview #10 — Registering Tables with the Catalog

Register for 2025 Apache Iceberg Summit Free Copy of Apache Iceberg: The Definitive Guide Free Apache Iceberg Course…
Iceberg REST Catalog Overview #9 — Fetching Scan Plan Tasks

2025年3月18日

Iceberg REST Catalog Overview #9 — Fetching Scan Plan Tasks

Register for 2025 Apache Iceberg Summit Free Copy of Apache Iceberg: The Definitive Guide Free Apache Iceberg Course…
Iceberg REST Catalog Overview #8 - Scan Plan Retrieval and Cancellation

2025年3月13日

Iceberg REST Catalog Overview #8 - Scan Plan Retrieval and Cancellation

Register for 2025 Apache Iceberg Summit Free Copy of Apache Iceberg: The Definitive Guide Free Apache Iceberg Course…
Iceberg REST Catalog Overview #7?-?Scan?Planning

2025年3月11日

Iceberg REST Catalog Overview #7?-?Scan?Planning

Register for 2025 Apache Iceberg Summit Free Copy of Apache Iceberg: The Definitive Guide Free Apache Iceberg Course…
Iceberg REST Catalog Overview #6 — Listing and Creating Tables

2025年3月6日

Iceberg REST Catalog Overview #6 — Listing and Creating Tables

Register for 2025 Apache Iceberg Summit Free Copy of Apache Iceberg: The Definitive Guide Free Apache Iceberg Course…
Iceberg REST Catalog Overview #5 — Namespace Metadata and Properties

2025年3月4日

Iceberg REST Catalog Overview #5 — Namespace Metadata and Properties

Free Copy of Apache Iceberg: The Definitive Guide Free Apache Iceberg Course 2025 Apache Iceberg Architecture Guide…

1 条评论
Iceberg REST Catalog Overview #4 — Managing Namespaces

2025年2月27日

Iceberg REST Catalog Overview #4 — Managing Namespaces

Free Copy of Apache Iceberg: The Definitive Guide Free Apache Iceberg Course 2025 Apache Iceberg Architecture Guide…
Iceberg REST Catalog Overview #3 — OAuth Authentication

2025年2月25日

Iceberg REST Catalog Overview #3 — OAuth Authentication

Free Copy of Apache Iceberg: The Definitive Guide Free Apache Iceberg Course 2025 Apache Iceberg Architecture Guide…

1 条评论
Using Helm with Kubernetes: A Guide to Helm Charts and Their Implementation

2025年2月21日

Using Helm with Kubernetes: A Guide to Helm Charts and Their Implementation

Free Apache Iceberg Course Free Copy of “Apache Iceberg: The Definitive Guide” 2025 Apache Iceberg Architecture Guide…

1 条评论
Iceberg REST Catalog Overview #2 — Catalog Configuration Endpoint

2025年2月20日

Iceberg REST Catalog Overview #2 — Catalog Configuration Endpoint

Free Copy of Apache Iceberg: The Definitive Guide Free Apache Iceberg Course 2025 Apache Iceberg Architecture Guide…

See all articles

Resources for Learning more about Catalog level versioning with Project Nessie & Dremio Arctic (Rollbacks, Branching, Tagging and Multi-Table Txns)

Alex Merced

Co-Author of “Apache Iceberg: The Definitive Guide” | Head of DevRel at Dremio | LinkedIn Learning Instructor | Tech Content Creator

Project Nessie

Project Nessie Resources

Dremio Arctic

领英推荐

Dremio Arctic Resources

CI/CD

Resources on CI/CD with Arctic/Nessie

Data Lakehouse Bytes with Alex

5,860 位关注者

Alex Merced的更多文章

社区洞察

其他会员也浏览了

October 19, 2024

SRE/Devops/Sysadmin newsletter : 2024/04

Memory Optimization Techniques for Spring Boot Applications with Practical Coding Strategies

Advanced Customization and Extension of AWS CLI

Maximize Performance: The Secret to Scaling Trino Clusters with KEDA

Day 15: Orchestrating Pipelines Using Apache Airflow for Pipeline Orchestration

Network Automation - Ansible v/s AppViewX - Complete Study

Custom Monitoring Metrics Springboot + Prometheus + Grafana (in a few words)

Comprehensive Guide to Using the DynamoDB Enhanced Client Over DynamoDBMapper

Building a Polyglot Persistent Social Media Platform: Architecture, Code, and Deployment Scripts

Project Nessie

Project Nessie Resources

Dremio Arctic

领英推荐

Dremio Arctic Resources

CI/CD

Resources on CI/CD with Arctic/Nessie

Data Lakehouse Bytes with Alex

5,860 位关注者

Alex Merced的更多文章

Iceberg REST Catalog Overview #10 — Registering Tables with the Catalog

Iceberg REST Catalog Overview #9 — Fetching Scan Plan Tasks

Iceberg REST Catalog Overview #8 - Scan Plan Retrieval and Cancellation

Iceberg REST Catalog Overview #7?-?Scan?Planning

Iceberg REST Catalog Overview #6 — Listing and Creating Tables

Iceberg REST Catalog Overview #5 — Namespace Metadata and Properties

Iceberg REST Catalog Overview #4 — Managing Namespaces

Iceberg REST Catalog Overview #3 — OAuth Authentication

Using Helm with Kubernetes: A Guide to Helm Charts and Their Implementation

Iceberg REST Catalog Overview #2 — Catalog Configuration Endpoint

社区洞察

其他会员也浏览了

October 19, 2024

SRE/Devops/Sysadmin newsletter : 2024/04

Memory Optimization Techniques for Spring Boot Applications with Practical Coding Strategies

Advanced Customization and Extension of AWS CLI

Maximize Performance: The Secret to Scaling Trino Clusters with KEDA

Day 15: Orchestrating Pipelines Using Apache Airflow for Pipeline Orchestration

Network Automation - Ansible v/s AppViewX - Complete Study

Custom Monitoring Metrics Springboot + Prometheus + Grafana (in a few words)

Comprehensive Guide to Using the DynamoDB Enhanced Client Over DynamoDBMapper

Building a Polyglot Persistent Social Media Platform: Architecture, Code, and Deployment Scripts