Data Modeling to Enable Shift Left: Part I

Data Modeling to Enable Shift Left: Part I

Summary

This post discusses how collaborative, multi-data-source data modeling tools can empower data engineers and developers to move the transformations and analytics in their data pipelines upstream towards the data sources. This best practice is known as Shift Left.

In Part I, we focus on the history of data modeling in the context of current data engineering best practices. We cover high level issues and general strategy for data modeling in Shift Left. Part II of this blog post will dive into the details of how a modern data modeling tool can empower a Shift Left strategy.

How Data Engineering and Data Modeling Work Together Today

Data modeling is an important technique for data engineers and application developers. It helps developers and data engineers reason about business entities and their relationships and attributes. By a process of successive refinement, abstract data models are leveraged by developers to generate physical schemas for databases, data warehouses, and streams. Data modeling's emphasis on deep interaction with users to narrow and refine requirements before coding is considered a best practice in software engineering. If done properly, it can also support analytics modeling downstream from application software design.

Prominent data engineering thought leaders including Joe Reis, Ben Rogojan, and Dave Wells emphasize data modeling and indicate that it is critical to successful data engineering. Dave Wells recently wrote a call-to-action for data engineers to rediscover modeling:

“Data modeling is a core skill of data engineering, but it is missing or inadequate in many data engineering teams. Today’s data engineers focus primarily on data pipelines—the processes to move data from one place to another. That is a natural consequence of recent focus on data lakes and data science. But the result is loss of focus on shaping and structuring data—the stuff of data modeling and database design.

Most modern data engineers are process engineers, not product engineers. But modern data management needs to have both process engineering and product engineering. The need for product engineering is apparent with the rising interest in data products and the need for product thinking. To achieve the right balance of process and product focus we need to rediscover the lost art of data modeling.?

I refer to data modeling as a lost art because it has existed as a data analysis and design discipline for decades. Entity-Relationship Modeling (ERM), introduced by Dr. Peter Chen in the mid-1970s, is the foundation of most modern data modeling techniques. ERM was widely practiced from the 1980s to around 2010 when data lakes became a data management priority. Attention to modeling diminished as data pipelines and schema-on-read became mainstream practices. It is time to rediscover the lost art of data modeling, starting with the foundational concepts of ER Modeling.”?

For those who need an introduction to modeling, we provide three classic references at the end of this post.

A Short History of Data Modeling

Entity-Relationship Models for Relational Databases

Relational, normalized databases and associated modeling techniques like entity-relationship (ER) data models became popular and widespread around 1980 after relational databases were introduced. For the next two decades, relational databases dominated application development, especially for business applications.??

Even though Entity-Relationship modeling was considered a best practice for relational application development, the models often weren’t co-resident with the code or updated regularly as the schema changed, as CI/CD and agile methodology would dictate. The schemas were formalized and recorded in the SQL DDL (Data Definition Language) statements, but not automatically, so the associated ER models would often diverge as the DDL and underlying physical schemas evolved.

Agile and NoSQL Reduced Interest in Data Modeling

The agile programming movement (2000) and popularity of NoSQL and document databases (DynamoDB, MongoDB, and others started around 2010) in the cloud era created a demand for alternatives to normalized, relational models. Relational required up-front schema design, which seemed antithetical to agile (it isn’t). Relational schemas were viewed as difficult to evolve. In addition, programmers loved document and NoSQL databases because their schemaless designs required neither up-front schema design nor complex SQL queries, especially joins, which were considered complex to write and slow to execute.?

Document and NoSQL database data types map directly to the programming objects used by developers. This obviated the need for complex, performance-limiting ORM (object-relational mapping) software required by applications when mapping data in relational tables into common, nested programming data structures. Non-relational data types and models are considered easier to change, and hence, more agile. As the cloud era took off, more data types (including JSON, logs, events, graphs, and AI vectors) were supported by databases. These data types were now part of the enterprise and cloud native data landscape, putting even more demands on data modeling.

The Data Mess

Meanwhile, the proliferation of data sources and types contributed to a data environment, in both enterprises and cloud native companies, that data engineers could only characterize as a “mess”. There was a lack of schemas, metadata, and data models for individual data sources. Data contracts built over these artifacts seem like a distant dream. Schemas that spanned across multiple data sources are also challenging. Any modern tool claiming it can build data models in complex enterprise environments must deal with all these factors.

Source: Chad Sanderson, Gable

Why Data Modeling is Making a Comeback

The proliferation of new data stores and data types, plus the complexity associated with today’s sprawling data infrastructures, has renewed interest in data models as part of the solution to the data mess. But these tools need to be expanded to handle more varied data types, especially nested, hierarchical data used by the popular JSON data interchange format.

For data warehousing workloads, data modeling never went away. It is heavily used to build schemas that support popular analytical schemas like Kimball and Inman. In the Medallion architecture, data is ETL’ed from sources in raw form via pipelines to a data warehouse, where it is transformed in several phases into usable form, often Kimball or Inman schemas. Most of the model development was performed within the target data warehouse.?

Cloud data warehouses reinvigorated the data warehouse paradigm by making them easier to deploy, manage, and scale while supporting a larger variety of data types. They worked with SQL, and tools like dbt abstracted away data pipeline infrastructure details so that anyone who knew SQL could build pipelines across nearly any database or data warehouse.

Beyond just data warehousing and the business analytics it supports, proper data modeling helps prevent poorly modeled schemas and tables in applications. Such schemas can lead to less efficient queries that perform poorly and are more expensive to run. Poorly modeled schemas (especially denormalized schemas) can also result in data inconsistency and duplication that is hard to detect and track. They also yield more complex ETL pipelines. If no schemas are provided, many optimizations cannot be performed, parsing the data takes longer, and enforcing quality constraints is nearly impossible.

We’ve shown that data modeling is used for building analytical models and schemas like Kimball and Inman. Today, in the pipelines built using the popular Medallion architecture, that work is shifted downstream from the data sources. In the next section, we point out some weaknesses in this approach. After that, we discuss an alternative approach called Shift Left and how data modeling can be applied within its framework.

Data Engineering Workflows Today: Medallion Architecture

Data engineering today is shifted right, focusing on the data warehouse and not on the actual data sources. Since its inception, data warehousing has emphasized ETL operations from sources to a centralized repository, either a data lake, lakehouse, or classic data warehouse. The medallion architecture uses separate stages to transform data from raw (bronze) to structured (silver) and then business ready form (gold). Typically each stage requires storage, network, and compute resources to support data copying between stages, and transformation operations within stages. Sometimes the stages are executed within the same data warehouse.


From

A Critique of the Medallion Architecture

Adam Bellemare has described the data engineering problems that result from the medallion architecture.

  • The consumer bears all the responsibility of keeping the data available and running, without having ownership or even influence on the source model. The ELT is coupled to the source model, making it brittle while consuming source database resources.
  • Medallion is expensive. Populating a bronze layer requires lots of copying and processing power. Data consumers have incentives to create their own pipelines. Finding similar datasets for reuse can be difficult because the ad hoc approach to accessing data often leads to silos and fragmented discovery. What’s needed is a way for those building data pipelines and consuming them to communicate with operational system owners.

“The medallion architecture is expensive because it is defensive as the responsibilities for making data accessible lay with the wrong people. Instead of relying on the consumer to craft clever pipelines and react quickly to inevitable breakages, involve those who actually created the data–the operational system owners.

  • Data quality can be difficult to achieve.

“So you ETL/ELT the data out of your system into your data lake. Now you need to denormalize it, restructure it, standardize it, remodel it, and make sense out of it–without making any mistakes. There is no substitution for getting the correct high-fidelity data from the source, as tack-sharp and optimally constructed as possible. And if you do it correctly, you’ll be able to reuse it for all your use cases–operational, analytical, and everything in between.”

  • Fragility because of the consumer’s indirect role in accessing the data.
  • There is no reusability for operational workloads. Data pushed to an analytics endpoint largely remains only in the analytical space, as does any cleanup, standardization, schematization, and transformation work. It is typically processed by periodic batch jobs, which prove to be too slow to handle operational use cases.

Data modeling in these shifted right workflows involves building schemas after the fact from the raw data obtained from disparate sources. There is often a lack of understanding of the data, what the primary entities are, their relationships to each other, etc. This information often must be reverse engineered by the data engineer and the downstream consumer, sometimes independently of the data producers. ?

An Alternative to Medallion: Shift Left

Shift Left is an alternative approach to Medallion that streams schematized, properly shaped data directly from sources like databases into shareable, reusable Kafka topics and Iceberg tables. These schematized topics and tables can be joined, transformed, and reused as necessary, with varying degrees of normalization, either directly with stream processing or via SQL, after materializing the topics into shared, headless storage.

Shift Left treats topics as data products produced at the data source, transformed and shaped by stream processing, and managed by platforms like Confluent stream governance. In contrast to the producer-consumer disconnect in the medallion workflow, data products have schemas, metadata, and constraints that establish precise data contracts between producers and consumers.?

Data Modeling for Shift Left

Confluent’s Shift Left strategy encourages data engineers to reconsider their traditional Medallion workflows.

Shift Left: Building on a Reliable and Quality-Focused Foundation

Shift Left solves the key problems with shifted right Medallion workflows. Recall the 5 top issues are:

  • Consumer-producer disconnect
  • Expensive?
  • Data quality
  • Fragility
  • Reusability

?Let’s review each one and the solution Shift Left provides.

Consumer-Producer Disconnect

Consumers collaborate with producers to formalize data contracts that support dynamic data products that producers can create, maintain and support and that consumers can fully leverage.

Expensive

Excessive compute and copying plus constant break fix work is replaced by well structured schemas and high quality data built right in the first pass.

Data Quality

Ensure data quality at the source, where the data is best understood, in context, before it propagates downstream to the consumer.

Fragility

Data contracts plus producer support for robust data pipelines and their evolution provides resilience.

Reusability

Shift Left allows topics and tables to be reused upstream by different data pipelines and applications, instead of being trapping in downstream analytics siloes.

With the Shift Left approach, data engineers? and consumers need tools that help them communicate what they need to producers so the latter can craft useful data products. Shifted right medallion workflows perform laborious, expensive and complex transformations that are made easier by tools like dbt and platforms like Snowflake. They follow common patterns to create denormalized star, snowflake, and other dimensional schemas that simplify and speed up analytics. It’s not optimal, but it works, and shifted right vendors provide the tools to support these common patterns.

Shifting Left into a headless data architecture

Data Contracts and Data Modeling

Because Shift Left encourages producers and consumers to work together to form contracts, the question arises: how can they jointly determine what to put in the data contract? How does Shift Left replace the well known shifted right data pipelines? Can we build tools that allow collaboration between data producers and consumers within a governance framework a data engineer can deploy confidently at scale?

Let’s step back and consider the ideal solution.

Facilitate, Document, and Formalize the Producer-Consumer Collaboration?

To encourage topic reuse, producers should have a scalable way of communicating with one or more consumers what they can provide, while consumers should be able to communicate what they need. This should happen in a graphical, easy-to-use interface that allows the data contract to be formalized with schemas, metadata, and constraints that are co-located with code in a modern CI/CD- oriented repository like GitHub and Confluent Schema Registry. The data contracts and their metadata etc. must be updated as part of the CI/CD process and their enforcement should be automated by devops. Confluent’s Data Portal governance tool provides a framework for managing this information once it’s created.

Support Modern Data Modeling Techniques and Requirements

Classic Entity-Relationship (ER) modeling relationship is still a powerful way to agree on business entities, their specification and how these entities are related. The tool should also support Domain-Driven Data Modeling (DDDM). Precision, clarity, and shared language are the allies of good data contracts. Such entities are also a powerful enabler of LLM co-pilots and AI assistants because they provide a shared vocabulary to enable effective human prompt engineering. Forward engineering from such a model is optimal, but reverse engineering from existing implementations is both necessary and will likely be the norm. Both should be supported. Building on ER modeling, the tool should provide polyglot data modeling so non-relational data types (like JSON and graphs) across separate data sources are supported.

Integration of the Data Modeling Tool with Confluent’s Governance Product Suite

Assuming the existence of such a tool, how does this impact the Shift Left landscape? Instead of reactive, mechanical “make this table available via CDC” approach, this tool creates a framework to collaborate to provide reusable data products that are governed, flexible, and agile. It should integrate with Confluent’s governance product portfolio, making it easy for customers to automatically configure (for example) Data Portal with the data models the tool generated. It should also detect topic and message schemas automatically and offer to register them in a schema registry.

Conclusion for Part I

In Part I of this two-part blog, we traced the history of data modeling. We made the case that if evolved properly, data modeling would help data engineers, application developers, data producers and consumers to build resilient, flexible, and reusable shifted-left data pipelines. In Part II, we will explore one such tool and show how it enables Shift Left.

References

The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd Edition, by Ralph Kimball and Margy Ross. Published by Wiley.

Oracle Database 23ai Data Modeling and Schema Design for JSON-Relational Duality, by Beda Hammerschmidt, Pascal Desmarets, and Steve Hoberman. Published by Technics Publications.

Building Event-Driven Microservices, by Adam Bellemare. Published by O’Reilly Media.

Pascal Desmarets

Founder/CEO at Hackolade

3 周

Excellent article Matthew O'Keefe, Ph.D. Looking forward to reading part 2. Thanks for the kind words. Just trying to restore a balance between over- and under-designing data structures. This balance is particularly important to meet the challenges of event design, where Shift Left provides so many advantages, as you describe.

要查看或添加评论,请登录

Matthew O'Keefe, Ph.D.的更多文章

社区洞察

其他会员也浏览了