登录查看更多内容

Data Modeling to Enable Shift Left: Part II

Matthew O'Keefe, Ph.D.

Principal Technologist

发布日期: 2025年3月25日

Summary

This post discusses how collaborative, multi-data-source modeling tools can empower data engineers and developers to move the transformations and analytics in their data pipelines upstream towards the data sources. This best practice is known as Shift Left.?

In Part I, we focused on the history of data modeling in the context of current data engineering best practices. We covered high level issues and general strategy for data modeling in Shift Left.?

Here in Part II, we will dive into the details of how a modern data modeling tool can empower a Shift Left strategy. The goal is to help data engineers, application developers, and data producers and consumers build resilient, flexible, and reusable shifted-left data pipelines.

What’s Required

We emphasize three requirements for shift left data modeling.

Support Modern Data Modeling Techniques and Requirements
Enable the Producer-Consumer Collaboration?
Integration with Confluent’s Schema Registry

We found one data modeling tool that provided most, and potentially all, of what we are looking for when shifting left. It is developed by Hackolade, a pioneer in polyglot data modeling. Polyglot modeling expands beyond the original three-phase conceptual-logical-physical modeling used for relational, normalized models. Instead, it supports a combined physical and logical model from which each of the various model types can be derived. It can be designed from scratch or seeded with an existing model.

Support for Modern Data Modeling Techniques and Requirements

Data modeling has evolved significantly from its strictly relational, Entity-Relationship model past.? Modern Domain-Driven designs using polyglot data modeling. can support varied data types and models, from traditional relational, document/hierarchical (mostly JSON), graph, key-value and columnar. This is critical because data for applications comes in many forms, and from many different types of sources.

Classic ER modeling workflow for relational models.

Polyglot data models can combine the underlying data types and models, from a wide variety of databases, APIs, query, graph and document engines, into a single model for an application. This simplifies data model extraction, design, and evolution from operational and analytic sources because the process happens underneath the umbrella of a single modeling tool.

An integrated tool allows the developer, acting as the data product consumer, to develop data contracts with each data source’s owner, known as the producer. The data contracts are commonly expressed as REST APIs or schemas. Because applications need fresh operational data delivered in real time, it is critical that these upstream, shifted left data sources be modeled together, in a collaborative way.

The polyglot data model is a "sort of a logical model in the sense that it is technology-agnostic, but with additional features:

it allows denormalization, if desired, given query-driven access patterns;
it allows complex data types;
it generates schemas for a variety of technologies, with automatic mapping to the specific data types of the respective target technologies;
it combines in a single model both a conceptual layer in the form of a graph diagram view to be friendly to business users, as well as an ER Diagram for the logical layer."

Polyglot Data Modeling and Applications

Modern applications are tending to use multiple types of data models, including documents, graphs, and traditional relational, normalized tables. Data types like geospatial, vectors (for AI embeddings), logs, streams, and JSON continue to gain in popularity.

These data models and types are also being integrated in new ways. For example, structured relational and document databases can support AI vectors that establish mappings to unstructured data like LLM embeddings or ML-trained hyperdimensional models.

The vector represents an attribute related to, and possibly derived from, the other attributes in the row or document. It can be embedded in the table or document or retrieved from a vector store. Classic data modeling can handle vectors as just another form of attribute, using either Entity-Relationship or Domain-Driven design methods.

Applications, especially AI applications, need real-time, contextualized data from a variety of sources. Data streaming helps solve both problems. Building applications from these sources in a way that meets user and operational requirements requires data models that are precise, minimal, and visual. Polyglot data modeling combined with a collaborative, visual design tool can meet this requirement.

Ultimately data modeling is about building applications and analytics right the first time, by getting the data models and user requirements defined precisely. This mentality will help support our primary goal here which is to make building applications, especially AI applications, easier and more predictable.

Practical Considerations for Designing Applications

Let’s consider retrofitting existing applications and data first. Data models be reverse engineered from DDL (Data Definition Language) statements like SQL CREATE TABLE for relational, and from sampling documents from document databases like MongoDB.?

Think of each data source as its own microservice. Event-driven architecture provides a framework for designing applications around microservices and events. The data modeling challenge then becomes developing data contracts for these events to provide the application’s event streaming foundation. The same process is required for traditional, and still very popular, monolithic applications.

Since denormalized, hierarchical models have become popular, polyglot modeling supports not only hierarchical models, but graph, columnar, key-value, and others. Beyond just supporting these models individually, the Hackolade data modeling tool can translate complex, denormalized data types into different syntaxes on the physical side.

For shift left data streaming, this is very powerful, as it allows us to translate different data sources (e.g., relational, graph, key-value, document, etc.) directly into Avro (among others), a popular binary format for Kafka.

Hackolade Data Modeling Tool

Polyglot modeling for data sources can be forward- or reverse-engineered. Hackolade provides a feature-rich GUI for forward-engineering based on data modeling best practices. The environment supports a spectrum of stakeholders including business owners, business users, data architects, developers, and customers, among others. By adding data consumers to the process, application developers and data architects (acting jointly as the data producer) can work with consumers to shape the data model and related data contract metadata within the Hackolade framework for a wide variety of data sources.

The GUI is separated into an Object Browser on the left, a central pane for entities, and a Properties Pane on the right. The Object Browser lets you search for specific entities. Once selected, the central pane lets you view and edit an Object using either an ER diagram or a hierarchical tree view. Zooming in, the figure above shows an ER diagram for an order entity that has relationships with the customer and product entities.

order Entity with relationships with customer and product entities.

We can also forward engineer an object. Here is a simple object we created with one entity called “New_record” and three fields.

A simple entity "New_record" with three fields.

We can add an additional entity and create a relationship with the existing entity.

Add a new entity and a relationship with the existing entity.

We can automatically generate an Avro schema for our simple two entity model.

We can edit the namespace called “New_Namespace” in the Properties Pane, including adding a name and associated metadata for this namespace. Any property selected in the Object Browser or central pane can be edited in the Properties Pane on the right.

Metadata can be added in the Properties Pane on the right.

Enable Producer-Consumer Collaboration?

Hackolade’s tool promotes collaboration between the Data Producer and Consumer. A team can either create their data model from scratch or reverse engineer it from existing schemas or data. In either case, developers, architects and business owners can transform, design, edit and document their schemas visually as a team. They can associate metadata with these models and colocate this metadata with the models in a git repository.

Co-locating data models and their schema artifacts with application code allows the data models to follow the lifecycle of application changes. Providing a single source-of-truth for business and technical stakeholders is possible because the data model in the Git repo is the source of the technical schemas used by applications, databases, and APIs. Simultaneously, it is the source for the business-facing data dictionaries. This architecture contributes to the shared understanding of the context and meaning of data by all the stakeholders.

Hackolade provides a native integration with Git repositories for your data models so that users do not have to use Git directly. The tool enables versioning, branching, change tracking, team collaboration, peer review, and other related capabilities.

An Example Schema Evolution with Git Support in Hackolade

In the example below for an existing production database instance (in yellow), Hackolade is used to infer the schema to create version 1.0 of the data model.? The model is enriched with descriptions and constraints to generate documentation.? In the meantime, the schema in the production database has evolved again with the delivery of new agile sprints and feedback from the business.? Simultaneously, a new feature development initiative is taking place to make the data model evolve.?

A couple of successive merge operations become necessary.? First, the reverse-engineering of the production database into a new model to be merged with model v1.1 into v1.2:

Followed by the the merging of the resulting v1.2 with 2.0 beta into v2.0 Release Candidate:

(All figures in this subsection courtesy of Hackolade).

Integrating with Confluent’s Schema Registry

The Confluent Schema Registry is a central repository with a RESTful interface for developers to define schemas for topics and register applications to enable compatibility. Data models defined in Hackolade for Avro schema and JSON Schema structures can be forward-engineered to or reverse-engineered from the Confluent Schema Registry, thereby facilitating the schema management and validation of Kafka pub/sub messages. Confluent Schema Registry can interact with JSON Schema models or Avro models in Hackolade directly via the GUI interface or CLI.

Once a schema has been created in Hackolade, it can be sent to the Confluent Schema Registry.

Forward engineering an Avro schema into Confluent Schema Registry.

Conclusions

In this post, we showed how a modern data modeling tool can empower a Shift Left strategy. Key capabilities around collaboration, polyglot data modeling, and integration with schema registry were described. Collaboration within the context of sophisticated change management using Git is particularly important as data producers and consumers work to stay in sync on the data products they share.

要查看或添加评论，请登录

Matthew O'Keefe, Ph.D.的更多文章

Tableflow: Unifying Streams and Tables to Enable Next-Gen AI Applications

2025年2月12日

Tableflow: Unifying Streams and Tables to Enable Next-Gen AI Applications

Introduction Operational and analytical estates have long been separated since data warehouses were first introduced in…

8 条评论
Data Modeling to Enable Shift Left: Part I

2025年2月7日

Data Modeling to Enable Shift Left: Part I

Summary This post discusses how collaborative, multi-data-source data modeling tools can empower data engineers and…

1 条评论
He Who Writes the Least Amount of Code Wins

2021年5月20日

He Who Writes the Least Amount of Code Wins

Software developers love new and innovative languages, frameworks, and cloud services. And they should.
How ISV's Can Beat Software's Iron Triangle

2021年4月7日

How ISV's Can Beat Software's Iron Triangle

Independent Software Vendors, or ISV's, can deliver higher value and better experiences to their customers in myriad…
How Migrating an App From AWS to Exadata Cloud Transformed a Company

2021年2月22日

How Migrating an App From AWS to Exadata Cloud Transformed a Company

It's unusual for re-engineered product design to radically change what a company can deliver to customers. And in the…

4 条评论
To Build Something Great, Explore the Space

2021年2月20日

To Build Something Great, Explore the Space

Bruce Dickinson: I’ll be honest..

1 条评论
Exploring Innovation, One Oracle Livelab at a Time

2020年10月23日

Exploring Innovation, One Oracle Livelab at a Time

Many large tech companies like Oracle spend a lot of money on R&D and produce a lot of innovation in the process. But…

1 条评论
Why Converged Databases Are Critical to Achieving Both Data and Developer Productivity

2020年7月23日

Why Converged Databases Are Critical to Achieving Both Data and Developer Productivity

(This work was done in collaboration with my very thoughtful Oracle colleague Paul Sonderegger.) Introduction…

1 条评论
MongoDB's Odd Announcement and What It Really Means for Relational Versus Document Stores

2020年6月22日

MongoDB's Odd Announcement and What It Really Means for Relational Versus Document Stores

In a recent press article, MongoDB announced that HSBC, a leading global bank, had migrated country data away from a…
Strange Love: How I Learned to Stop Worrying and Love My Health Insurance Company

2020年6月20日

Strange Love: How I Learned to Stop Worrying and Love My Health Insurance Company

We all hate our health insurance companies right? They're the ones not paying all our health care bills when we feel…

See all articles

Summary

What’s Required

Support for Modern Data Modeling Techniques and Requirements

Polyglot Data Modeling and Applications

Practical Considerations for Designing Applications

Hackolade Data Modeling Tool

Enable Producer-Consumer Collaboration?

An Example Schema Evolution with Git Support in Hackolade

Integrating with Confluent’s Schema Registry

Conclusions

Matthew O'Keefe, Ph.D.的更多文章

Tableflow: Unifying Streams and Tables to Enable Next-Gen AI Applications

Data Modeling to Enable Shift Left: Part I

He Who Writes the Least Amount of Code Wins

How ISV's Can Beat Software's Iron Triangle

How Migrating an App From AWS to Exadata Cloud Transformed a Company

To Build Something Great, Explore the Space

Exploring Innovation, One Oracle Livelab at a Time

Why Converged Databases Are Critical to Achieving Both Data and Developer Productivity

MongoDB's Odd Announcement and What It Really Means for Relational Versus Document Stores

Strange Love: How I Learned to Stop Worrying and Love My Health Insurance Company