Data Vault

Data Vault

A data vault is a data modeling method that helps organize and store data in data warehouses, lake houses, and meshes. It's a flexible, scalable, and agile approach for managing and integrating large amounts of data from different sources. A data vault is a data modeling design pattern used to build a data warehouse for enterprise-scale analytics. The data vault has three types of entities: hubs, links, and satellites.

Data vault modeling: Hubs, links, and satellites

  • Hubs - Each hub represents a core business concept, such as they represent Customer Id/Product Number/Vehicle identification number (VIN). Users will use a business key to get information about a Hub. The business key may have a combination of business concept ID and sequence ID, load date, and other metadata information.
  • Links - Links represent the relationship between Hub entities.
  • Satellites - Satellites fill the gap in answering the missing descriptive information on core business concepts. Satellites store information that belongs to Hub and relationships between them.

A few additional things to keep in mind:

  • A satellite cannot have a direct connection to another satellite.
  • A hub or link may have one or more satellites.

Here are some characteristics of a data vault:

  • Hybrid approach: Combines the strengths of 3rd normal form and star schema?
  • Granular: Captures data in its most granular form, ensuring that every piece of information is stored without loss of detail or context?
  • Historical tracking: A historical repository of enterprise data?
  • Auditable: A non-volatile, auditable repository of enterprise data?
  • Well-suited to dynamic industries: Enables businesses to adapt quickly to changes in their data environment?
  • Improves data quality: Enhances data quality and usability across the organization

Data vault benefits

  • Agile
  • Structured, with flexibility for refactoring
  • Extremely scalable, up to PBs volumes
  • Uses patterns that support ETL code generation
  • Familiar architecture: data layers, ETL, star schemas
  • A Data Vault (DV) provides you a robust foundation for building and managing enterprise data warehouses, especially in scenarios where data sources are numerous, diverse, and subject to change.
  • The benefits of using a this modeling technique include:
  • Flexibility:?DV’s are based on agile methodologies and techniques, so they’re designed to handle changes and additions to data sources and business requirements with minimal disruption. This makes them well-suited for environments with evolving data requirements, such as adding or deleting columns, new tables, or new/altered relationships.
  • Scalability:?DV’s can accommodate large volumes of data (up to PBs volumes) and support the integration of data from a wide range of sources. As such, this model is a great fit for organizations implementing a data lake or data lake house.
  • Auditability:?They maintain a complete history of the data, making it easier to track changes over time and meet HIPAA compliance and other auditing requirements and regulations.
  • Ease of Maintenance:?They simplify the process of incorporating new data sources or modifying existing ones, reducing the time and effort required for maintenance. For example, there's less need for extensive refactoring of ETL jobs when the model undergoes changes. Plus, this approach simplifies the data ingestion process, removes the cleansing requirement of a Star Schema.
  • Parallelization:?Data can be loaded in parallel, allowing for efficient processing of large datasets.
  • Ease of Setup:?They have a familiar architecture–employing data layers, ETL, and star schemas–so your teams can establish this approach without extensive training.

要查看或添加评论,请登录

Rohit Singh的更多文章

  • TypeScript

    TypeScript

    TypeScript is a superset of JavaScript that adds extra features like static typing, interfaces, Enums, and more…

  • Python Django

    Python Django

    Python Django Python-based web framework Django allows you to create efficient web applications quickly. It is also…

  • Apache Parquet

    Apache Parquet

    Apache Parquet is an open-source columnar storage format used to efficiently store, manage and analyze large datasets…

  • Scope management

    Scope management

    Project scope refers to the detailed description of the deliverables, objectives, tasks, and goals that need to be…

  • Selenium WebDriver

    Selenium WebDriver

    Selenium WebDriver is a powerful Automation tool widely used for web application testing. It provides a programming…

  • Robot Framework

    Robot Framework

    Robot Framework is an open-source test automation framework for acceptance testing and acceptance test-driven…

  • Azure Active Directory

    Azure Active Directory

    Azure Active Directory (Azure AD), now known as Microsoft Entra ID, is a cloud-based identity and access management…

  • Matillion

    Matillion

    Matillion is a cloud-native data integration platform that simplifies and accelerates the ELT (Extract, Load…

  • Azure Blob storage

    Azure Blob storage

    Blob storage is a type of cloud storage for unstructured data, like images, videos, or documents, where data is stored…

  • BI Testing

    BI Testing

    BI testing, or Business Intelligence testing, verifies and validates the accuracy and reliability of insights delivered…

社区洞察

其他会员也浏览了