Project Nessie

Project Nessie

Project Nessie is an open-source transactional catalog for data lakes, built to provide Git-like semantics for data version control, branching, and reproducibility across various data lake storage layers. It’s designed to work with modern data lake engines like Apache Iceberg, Delta Lake, and Hive, offering a unified and consistent interface for managing datasets in environments like Apache Spark, Flink, Presto, and Trino.


?? What is Project Nessie?

Project Nessie is a versioned data lake catalog that introduces Git-like semantics (e.g., branches, commits, merges) to data lakes. It helps manage data across different engines and storage layers, ensuring consistency, reproducibility, and isolation.

  • Data Version Control: Track changes to datasets with commit history.
  • Branching and Merging: Enable parallel workstreams with isolated environments.
  • Transactional Consistency: Ensure ACID transactions across data lakes.
  • Engine Agnostic: Integrate with popular engines like Spark, Flink, Trino, and Presto.
  • Simplified Data Governance: Centralized data catalog for auditing and governance.

Nessie acts as a metadata management layer, providing a versioned view of your data. It stores metadata pointers for datasets while the actual data remains in cloud object stores (e.g., AWS S3, Azure Data Lake).

  1. Cataloging: Maintains a global namespace for datasets.
  2. Versioning: Tracks commits with metadata pointers, like Git commits.
  3. Branching: Create isolated branches for experiments without affecting production data.
  4. Merging: Merge changes from one branch into another.
  5. Tagging: Tag specific commits for reproducibility.


?? Core Features

?? 1. Git-Like Operations

  • commit: Save a snapshot of the dataset.
  • branch: Create isolated branches (e.g., dev, prod, test).
  • merge: Merge changes from one branch to another.
  • tag: Mark specific points in time.

?? 2. Transactional Consistency

  • ACID transactions ensure correctness across multiple engines.

?? 3. Isolation & Reproducibility

  • Branch-based isolation for testing and experimentation.
  • Reproduce historical views of data with commit IDs.

?? 4. Engine Integration

Works with:

  • Apache Iceberg
  • Delta Lake
  • Apache Hive
  • Apache Spark
  • Trino / Presto

?? 5. Data Governance & Auditing

  • Tracks metadata and access logs for audit trails.


?? Architectural Overview

Core Components:

  1. Metadata Store: Tracks schema, partitions, and commit history.
  2. Storage Backend: Cloud object stores like AWS S3, Azure ADLS, Google Cloud Storage.
  3. Catalog API: Provides APIs for data access and operations.
  4. Clients: Spark, Flink, Presto, Trino clients for seamless integration.

要查看或添加评论,请登录

SHOAIB SHAIK的更多文章

  • spaCy part 1

    spaCy part 1

    spacy is an interesting production-ready, deployable package that help in the processing the language processing with…

  • BIG QUERY- Part 1

    BIG QUERY- Part 1

    Simple Answer to understand Part 1 what is bigquery big query is a fully manged service that helps the user and eng to…

  • Responsibility for the data engineer part 2

    Responsibility for the data engineer part 2

    Technical Responsibilities You must understand how to build architectures that optimize performance and cost at a high…

  • What does the data engineering do.

    What does the data engineering do.

    data engineer do a set of operations aimed at creating interface and mechanisms for flow and access of the information,…

    1 条评论
  • Big Query - GCP

    Big Query - GCP

    What is BigQuery? How does BigQuery work? BigQuery administration and access BigQuery best practices and cost…

  • Deep Learning Intro 1

    Deep Learning Intro 1

    Deep learning is a computer technique to extract and transform data—with use cases ranging from human speech…

  • Hadoop FLUME SQOOP

    Hadoop FLUME SQOOP

    Hadoop NameNode Metadata Discusses the components of Hadoop NameNode metadata, including fsimage and edits files, and…

  • SQL Basic 1

    SQL Basic 1

    Client/Server Architecture After the era of mainframes, the shift was towards client/server systems where a main…

  • LINUX Basic 1

    LINUX Basic 1

    The text provides an overview of various commands and utilities in Linux, including regular expressions, #grep, #find…

  • Python Basic Questions

    Python Basic Questions

    \ Parser: What is the position of the parser we use the parser to have the code translated to the byte code level…

社区洞察

其他会员也浏览了