登录查看更多内容

Project Nessie

SHOAIB SHAIK

?? Published author, AI and Deep learning; fascinated with technology with a deep passion to use technology to make the world a better place. Time, space, Quantum computing are traits I am happy to be involved.

发布日期: 2025年2月18日

Project Nessie is an open-source transactional catalog for data lakes, built to provide Git-like semantics for data version control, branching, and reproducibility across various data lake storage layers. It’s designed to work with modern data lake engines like Apache Iceberg, Delta Lake, and Hive, offering a unified and consistent interface for managing datasets in environments like Apache Spark, Flink, Presto, and Trino.

?? What is Project Nessie?

Project Nessie is a versioned data lake catalog that introduces Git-like semantics (e.g., branches, commits, merges) to data lakes. It helps manage data across different engines and storage layers, ensuring consistency, reproducibility, and isolation.

Data Version Control: Track changes to datasets with commit history.
Branching and Merging: Enable parallel workstreams with isolated environments.
Transactional Consistency: Ensure ACID transactions across data lakes.
Engine Agnostic: Integrate with popular engines like Spark, Flink, Trino, and Presto.
Simplified Data Governance: Centralized data catalog for auditing and governance.

Nessie acts as a metadata management layer, providing a versioned view of your data. It stores metadata pointers for datasets while the actual data remains in cloud object stores (e.g., AWS S3, Azure Data Lake).

Cataloging: Maintains a global namespace for datasets.
Versioning: Tracks commits with metadata pointers, like Git commits.
Branching: Create isolated branches for experiments without affecting production data.
Merging: Merge changes from one branch into another.
Tagging: Tag specific commits for reproducibility.

?? Core Features

?? 1. Git-Like Operations

commit: Save a snapshot of the dataset.
branch: Create isolated branches (e.g., dev, prod, test).
merge: Merge changes from one branch to another.
tag: Mark specific points in time.

领英推荐

January 2025 - Apache Iceberg REST catalog support…

ClickHouse 2 个月前

Deep Dive into Dremio's File-based Auto Ingestion into…

Alex Merced 4 个月前

Apache Sedona and Ilum: Spatial Indexing and…

Adam Lichtenstein 9 个月前

?? 2. Transactional Consistency

ACID transactions ensure correctness across multiple engines.

?? 3. Isolation & Reproducibility

Branch-based isolation for testing and experimentation.
Reproduce historical views of data with commit IDs.

?? 4. Engine Integration

Works with:

Apache Iceberg
Delta Lake
Apache Hive
Apache Spark
Trino / Presto

?? 5. Data Governance & Auditing

Tracks metadata and access logs for audit trails.

?? Architectural Overview

Core Components:

Metadata Store: Tracks schema, partitions, and commit history.
Storage Backend: Cloud object stores like AWS S3, Azure ADLS, Google Cloud Storage.
Catalog API: Provides APIs for data access and operations.
Clients: Spark, Flink, Presto, Trino clients for seamless integration.

要查看或添加评论，请登录

SHOAIB SHAIK的更多文章

spaCy part 1

2025年3月20日

spaCy part 1

spacy is an interesting production-ready, deployable package that help in the processing the language processing with…
BIG QUERY- Part 1

2025年1月19日

BIG QUERY- Part 1

Simple Answer to understand Part 1 what is bigquery big query is a fully manged service that helps the user and eng to…
Responsibility for the data engineer part 2

2024年11月4日

Responsibility for the data engineer part 2

Technical Responsibilities You must understand how to build architectures that optimize performance and cost at a high…
What does the data engineering do.

2024年11月4日

What does the data engineering do.

data engineer do a set of operations aimed at creating interface and mechanisms for flow and access of the information,…

1 条评论
Big Query - GCP

2024年10月31日

Big Query - GCP

What is BigQuery? How does BigQuery work? BigQuery administration and access BigQuery best practices and cost…
Deep Learning Intro 1

2024年7月29日

Deep Learning Intro 1

Deep learning is a computer technique to extract and transform data—with use cases ranging from human speech…
Hadoop FLUME SQOOP

2024年3月5日

Hadoop FLUME SQOOP

Hadoop NameNode Metadata Discusses the components of Hadoop NameNode metadata, including fsimage and edits files, and…
SQL Basic 1

2024年2月29日

SQL Basic 1

Client/Server Architecture After the era of mainframes, the shift was towards client/server systems where a main…
LINUX Basic 1

2024年2月21日

LINUX Basic 1

The text provides an overview of various commands and utilities in Linux, including regular expressions, #grep, #find…
Python Basic Questions

2023年12月29日

Python Basic Questions

\ Parser: What is the position of the parser we use the parser to have the code translated to the byte code level…

See all articles

Project Nessie

SHOAIB SHAIK

?? Published author, AI and Deep learning; fascinated with technology with a deep passion to use technology to make the world a better place. Time, space, Quantum computing are traits I am happy to be involved.

?? What is Project Nessie?

?? Core Features

?? 1. Git-Like Operations

领英推荐

?? 2. Transactional Consistency

?? 3. Isolation & Reproducibility

?? 4. Engine Integration

?? 5. Data Governance & Auditing

?? Architectural Overview

SHOAIB SHAIK的更多文章

社区洞察

其他会员也浏览了

Airflow Architecture

Essential Tools for Data Engineering

Power Down Stream Relational Database Aurora Postgres from Apache Hudi Transactional Data Lake with CDC| Step by Step Guide

ElasticSearch

Just Enough Spark! Core Concepts Revisited !!

Prophecy.io Transpiler: Modernizing Legacy ETL Pipelines

Getting Started with Docker: A Guide for Data Engineers

Efficiently Managing Ride and Late Arriving Tips Data with Incremental ETL using Apache Hudi : Step by Step Guide

Building an ETL App with Streamlit

Data Engineering Best Practices with Scala: Unlocking the Power of Big Data

?? What is Project Nessie?

?? Core Features

?? 1. Git-Like Operations

领英推荐

?? 2. Transactional Consistency

?? 3. Isolation & Reproducibility

?? 4. Engine Integration

?? 5. Data Governance & Auditing

?? Architectural Overview

SHOAIB SHAIK的更多文章

spaCy part 1

BIG QUERY- Part 1

Responsibility for the data engineer part 2

What does the data engineering do.

Big Query - GCP

Deep Learning Intro 1

Hadoop FLUME SQOOP

SQL Basic 1

LINUX Basic 1

Python Basic Questions

社区洞察

其他会员也浏览了

Airflow Architecture

Essential Tools for Data Engineering

Power Down Stream Relational Database Aurora Postgres from Apache Hudi Transactional Data Lake with CDC| Step by Step Guide

ElasticSearch

Just Enough Spark! Core Concepts Revisited !!

Prophecy.io Transpiler: Modernizing Legacy ETL Pipelines

Getting Started with Docker: A Guide for Data Engineers

Efficiently Managing Ride and Late Arriving Tips Data with Incremental ETL using Apache Hudi : Step by Step Guide

Building an ETL App with Streamlit

Data Engineering Best Practices with Scala: Unlocking the Power of Big Data