Data Modeling to Enable Shift Left: Part II

Data Modeling to Enable Shift Left: Part II

Summary

This post discusses how collaborative, multi-data-source modeling tools can empower data engineers and developers to move the transformations and analytics in their data pipelines upstream towards the data sources. This best practice is known as Shift Left.?

In Part I, we focused on the history of data modeling in the context of current data engineering best practices. We covered high level issues and general strategy for data modeling in Shift Left.?

Here in Part II, we will dive into the details of how a modern data modeling tool can empower a Shift Left strategy. The goal is to help data engineers, application developers, and data producers and consumers build resilient, flexible, and reusable shifted-left data pipelines.

What’s Required

We emphasize three requirements for shift left data modeling.

  • Support Modern Data Modeling Techniques and Requirements
  • Enable the Producer-Consumer Collaboration?
  • Integration with Confluent’s Schema Registry

We found one data modeling tool that provided most, and potentially all, of what we are looking for when shifting left. It is developed by Hackolade, a pioneer in polyglot data modeling. Polyglot modeling expands beyond the original three-phase conceptual-logical-physical modeling used for relational, normalized models. Instead, it supports a combined physical and logical model from which each of the various model types can be derived. It can be designed from scratch or seeded with an existing model.

Support for Modern Data Modeling Techniques and Requirements

Data modeling has evolved significantly from its strictly relational, Entity-Relationship model past.? Modern Domain-Driven designs using polyglot data modeling. can support varied data types and models, from traditional relational, document/hierarchical (mostly JSON), graph, key-value and columnar. This is critical because data for applications comes in many forms, and from many different types of sources.

Classic ER modeling workflow for relational models.

Polyglot data models can combine the underlying data types and models, from a wide variety of databases, APIs, query, graph and document engines, into a single model for an application. This simplifies data model extraction, design, and evolution from operational and analytic sources because the process happens underneath the umbrella of a single modeling tool.

An integrated tool allows the developer, acting as the data product consumer, to develop data contracts with each data source’s owner, known as the producer. The data contracts are commonly expressed as REST APIs or schemas. Because applications need fresh operational data delivered in real time, it is critical that these upstream, shifted left data sources be modeled together, in a collaborative way.

Polyglot Data Modeling.

The polyglot data model is a "sort of a logical model in the sense that it is technology-agnostic, but with additional features:

  • it allows denormalization, if desired, given query-driven access patterns;
  • it allows complex data types;
  • it generates schemas for a variety of technologies, with automatic mapping to the specific data types of the respective target technologies;
  • it combines in a single model both a conceptual layer in the form of a graph diagram view to be friendly to business users, as well as an ER Diagram for the logical layer."

Polyglot Data Modeling and Applications

Modern applications are tending to use multiple types of data models, including documents, graphs, and traditional relational, normalized tables. Data types like geospatial, vectors (for AI embeddings), logs, streams, and JSON continue to gain in popularity.

These data models and types are also being integrated in new ways. For example, structured relational and document databases can support AI vectors that establish mappings to unstructured data like LLM embeddings or ML-trained hyperdimensional models.

The vector represents an attribute related to, and possibly derived from, the other attributes in the row or document. It can be embedded in the table or document or retrieved from a vector store. Classic data modeling can handle vectors as just another form of attribute, using either Entity-Relationship or Domain-Driven design methods.

Applications, especially AI applications, need real-time, contextualized data from a variety of sources. Data streaming helps solve both problems. Building applications from these sources in a way that meets user and operational requirements requires data models that are precise, minimal, and visual. Polyglot data modeling combined with a collaborative, visual design tool can meet this requirement.

Ultimately data modeling is about building applications and analytics right the first time, by getting the data models and user requirements defined precisely. This mentality will help support our primary goal here which is to make building applications, especially AI applications, easier and more predictable.

Practical Considerations for Designing Applications

Let’s consider retrofitting existing applications and data first. Data models be reverse engineered from DDL (Data Definition Language) statements like SQL CREATE TABLE for relational, and from sampling documents from document databases like MongoDB.?

Think of each data source as its own microservice. Event-driven architecture provides a framework for designing applications around microservices and events. The data modeling challenge then becomes developing data contracts for these events to provide the application’s event streaming foundation. The same process is required for traditional, and still very popular, monolithic applications.

Since denormalized, hierarchical models have become popular, polyglot modeling supports not only hierarchical models, but graph, columnar, key-value, and others. Beyond just supporting these models individually, the Hackolade data modeling tool can translate complex, denormalized data types into different syntaxes on the physical side.

For shift left data streaming, this is very powerful, as it allows us to translate different data sources (e.g., relational, graph, key-value, document, etc.) directly into Avro (among others), a popular binary format for Kafka.

Hackolade Data Modeling Tool

Polyglot modeling for data sources can be forward- or reverse-engineered. Hackolade provides a feature-rich GUI for forward-engineering based on data modeling best practices. The environment supports a spectrum of stakeholders including business owners, business users, data architects, developers, and customers, among others. By adding data consumers to the process, application developers and data architects (acting jointly as the data producer) can work with consumers to shape the data model and related data contract metadata within the Hackolade framework for a wide variety of data sources.


Hackolade Studio GUI: Top Level.

The GUI is separated into an Object Browser on the left, a central pane for entities, and a Properties Pane on the right. The Object Browser lets you search for specific entities. Once selected, the central pane lets you view and edit an Object using either an ER diagram or a hierarchical tree view. Zooming in, the figure above shows an ER diagram for an order entity that has relationships with the customer and product entities.

order Entity with relationships with customer and product entities.

We can also forward engineer an object. Here is a simple object we created with one entity called “New_record” and three fields.

A simple entity "New_record" with three fields.

We can add an additional entity and create a relationship with the existing entity.

Add a new entity and a relationship with the existing entity.

We can automatically generate an Avro schema for our simple two entity model.

Avro schema for the two entity model.

We can edit the namespace called “New_Namespace” in the Properties Pane, including adding a name and associated metadata for this namespace. Any property selected in the Object Browser or central pane can be edited in the Properties Pane on the right.

Metadata can be added in the Properties Pane on the right.

Enable Producer-Consumer Collaboration?

Hackolade’s tool promotes collaboration between the Data Producer and Consumer. A team can either create their data model from scratch or reverse engineer it from existing schemas or data. In either case, developers, architects and business owners can transform, design, edit and document their schemas visually as a team. They can associate metadata with these models and colocate this metadata with the models in a git repository.

Co-locating data models and their schema artifacts with application code allows the data models to follow the lifecycle of application changes. Providing a single source-of-truth for business and technical stakeholders is possible because the data model in the Git repo is the source of the technical schemas used by applications, databases, and APIs. Simultaneously, it is the source for the business-facing data dictionaries. This architecture contributes to the shared understanding of the context and meaning of data by all the stakeholders.

Hackolade provides a native integration with Git repositories for your data models so that users do not have to use Git directly. The tool enables versioning, branching, change tracking, team collaboration, peer review, and other related capabilities.

An Example Schema Evolution with Git Support in Hackolade

In the example below for an existing production database instance (in yellow), Hackolade is used to infer the schema to create version 1.0 of the data model.? The model is enriched with descriptions and constraints to generate documentation.? In the meantime, the schema in the production database has evolved again with the delivery of new agile sprints and feedback from the business.? Simultaneously, a new feature development initiative is taking place to make the data model evolve.?


A couple of successive merge operations become necessary.? First, the reverse-engineering of the production database into a new model to be merged with model v1.1 into v1.2:

Followed by the the merging of the resulting v1.2 with 2.0 beta into v2.0 Release Candidate:

(All figures in this subsection courtesy of Hackolade).

Integrating with Confluent’s Schema Registry

The Confluent Schema Registry is a central repository with a RESTful interface for developers to define schemas for topics and register applications to enable compatibility. Data models defined in Hackolade for Avro schema and JSON Schema structures can be forward-engineered to or reverse-engineered from the Confluent Schema Registry, thereby facilitating the schema management and validation of Kafka pub/sub messages. Confluent Schema Registry can interact with JSON Schema models or Avro models in Hackolade directly via the GUI interface or CLI.

Once a schema has been created in Hackolade, it can be sent to the Confluent Schema Registry.

Forward engineering an Avro schema into Confluent Schema Registry.

Conclusions

In this post, we showed how a modern data modeling tool can empower a Shift Left strategy. Key capabilities around collaboration, polyglot data modeling, and integration with schema registry were described. Collaboration within the context of sophisticated change management using Git is particularly important as data producers and consumers work to stay in sync on the data products they share.


?


要查看或添加评论,请登录

Matthew O'Keefe, Ph.D.的更多文章