Project Nessie
SHOAIB SHAIK
?? Published author, AI and Deep learning; fascinated with technology with a deep passion to use technology to make the world a better place. Time, space, Quantum computing are traits I am happy to be involved.
Project Nessie is an open-source transactional catalog for data lakes, built to provide Git-like semantics for data version control, branching, and reproducibility across various data lake storage layers. It’s designed to work with modern data lake engines like Apache Iceberg, Delta Lake, and Hive, offering a unified and consistent interface for managing datasets in environments like Apache Spark, Flink, Presto, and Trino.
?? What is Project Nessie?
Project Nessie is a versioned data lake catalog that introduces Git-like semantics (e.g., branches, commits, merges) to data lakes. It helps manage data across different engines and storage layers, ensuring consistency, reproducibility, and isolation.
Nessie acts as a metadata management layer, providing a versioned view of your data. It stores metadata pointers for datasets while the actual data remains in cloud object stores (e.g., AWS S3, Azure Data Lake).
?? Core Features
?? 1. Git-Like Operations
领英推荐
?? 2. Transactional Consistency
?? 3. Isolation & Reproducibility
?? 4. Engine Integration
Works with:
?? 5. Data Governance & Auditing
?? Architectural Overview
Core Components: