Hey, can you pull this data for me?

Hey, can you pull this data for me?

Sure, let me just ...

Gold data assets

Gold data assets refer to high-quality, reliable, and well-structured datasets that have been thoroughly validated and curated. Having access to such data assets is essential, as they build the foundation for data-driven decisions, predictive analytics, and efficient processes.

However, building gold data assets takes quite some time and involves a systematic process of

  • gathering data sources and documentation
  • defining the data's shape and any limitations
  • cleaning, munging, and wrangling the data into a usable form
  • (...)

A common data design pattern is the medallion architecture to logically organize data in a lakehouse. As the data passes through each layer of the architecture, its structure, quality and maturity continuously improves.

Medallion architecture

The bronze layer serves as the initial landing point for data from external source systems, preserving the source system table structures "as-is" while incorporating metadata columns.

Within the silver layer of the lakehouse, data from the bronze layer undergoes matching, merging, conformation, and "just-enough" cleansing to enable an enterprise view of master data, such as customers and transactions.

Data in the gold layer of the lakehouse is typically modeled to be consumption-ready, e.g. by introducing

  • read-optimized data models with fewer joins,
  • Kimball star schema or
  • Inmon data marts.

Introducing another layer ...

In my past projects, I decided to introduce another layer to the medallion architecture, to separate the implementation of hard and soft rules. Basically, I renamed the bronze layer to landing zone, and applied hard rules on the bronze layer and soft rules on the silver layer.

Slightly modified medallion architecture

Here's the reason:

Hard rules do not alter the contents or the granularity of the data, thus maintaining auditability. Further, applying those rules rarely involves business departments, so that Data engineers can focus on core data engineering, such as

  • common data model, e.g. data vault
  • data typing
  • de-duplication

Soft rules change the data, e.g. by introducing business logic. As applying those rules require input from business analysts or subject-matter experts, this can be a time-consuming task until all requirements have been collected, also involving to ask the right questions. For example, you might want to

  • establish a common lexicon and rename columns to match their business meaning
  • model business process across multiple data assets
  • enhance the data model by calculating KPIs
  • (...)

Here, I try to model business processes at the lowest granularity possible, often requiring collaboration across departments - leaving aggregation, filtering etc. up to the gold layer, based on the requirements of the business department or visualization technology, e.g. optimizing the table for process mining.

Over time, these assets will become valuable resources for informed decision-making and business growth. ??


要查看或添加评论,请登录

Nicolai Ernst的更多文章

  • This Month Data Engineering (September '24)

    This Month Data Engineering (September '24)

    ?? This Month's Highlights Databricks' publish to Power BI feature and its VS Code extension become GA DuckDB v1.1.

    1 条评论
  • This Month Data Engineering (August '24)

    This Month Data Engineering (August '24)

    ?? This Month's Highlights Drugstore operator dm released a corporate version of ChatGPT called dmGPT (German) OpenAI…

  • Some Thoughts on Artificial General Intelligence

    Some Thoughts on Artificial General Intelligence

    Artificial general intelligence refers to a machine that is capable of behaving intelligently across a wide range of…

    1 条评论

社区洞察

其他会员也浏览了