Cloud Data Warehousing—Architectural Design Patterns

I imagine you’ve seen many IT diagrams masquerading as architectures! I use a simple rule: if there’s a product or open-source component name or logo showing, it’s not an architecture. I used to call such pictures “boxologies”! Which sounds a bit rude…

All the figures in the previous four posts are architectures. The first data warehouse diagram I drew way back in 1988 was also an architecture. However, with the emergence of the many flavors of data warehousing in the past decade, I find that “pure” architecture is now rare.

Many diagrams use the same terms to describe subtly or distinctly different things. In other cases, the same solution is named as two different things. Indeed, some solutions are described as a combination of different things, such as, a vendor may suggest that their offering applies “data mesh concepts” on top of a “data fabric base” to deliver a “data lakehouse solution.” I exaggerate only a little. But what could this possibly mean?

Maybe I personally need to be a little more flexible! But I implore vendors and consultants to be more strict. I’ve started to talk about architectural design patterns (ADPs) to allow more space for variations on a theme and for concepts to develop and evolve. An ADP offers a set of terminology and, usually, a picture, on which we can agree, at least throughout an ongoing discussion. It encapsulates the key business needs and fundamental infrastructure requirements and constraints of a particular solution approach.

In “Cloud Data Warehousing—Volume I: Architecting Data Warehouse, Lakehouse, Mesh, and Fabric” (available here), I describe six architectural design patterns: three are foundational and three are emergent and evolving even as I write. The following is adapted and abridged from that book.

Data warehouse classic (DWC): provides correct and consistent, well-modeled, schema-on-write, relevant, and usable—as far as possible—information in support of business analysis and decision-making needs in a cross-business manner. A DWC may be structured as a hub-and-spoke pattern, a dimensional / star-schema pattern, or some combination of both.

On-premises and cloud versions of this ADP exist because technology can and does drive important differences at the physical implementation level. DWC/op is implemented with “on-premises” technologies based on finite servers or server clusters, using “conventional” relational database technologies. DWC/cn is built on “cloud-native”?technology, including automatically elastic and scalable features, object storage, and separate compute and storage, with multi-cluster compute.

Logical data warehouse (LDW): extends a DWC with direct, real-time access to data in other sources, such as operational systems, files, NoSQL stores, etc. Access is mediated through an overarching logical data model describing the different data sources in a common language. Businesspeople access all the data through data virtualization?technology.

Data lake classic (DLC): offers data in raw, as-received format, or with limited preprocessing and cleansing at the discretion of the business user. Key characteristics include scalable data storage in any format, multiple processing models, and timely, flexible usage (schema-on-read). In many cases, data governance is limited, with users left to their own devices to figure out which data to use when.

The three emergent ADPs are:

Data lakehouse: proposes an elastic cloud solution to a combination of DWC and DLC needs, despite their clearly conflicting nature. It offers an environment based on an object store as a single well-governed storage layer for all structured and semi-structured data, managed and accessed through “relational-like” function, with some technical metadata support. In addition, loosely structured (so-called “unstructured”) data is included, as found in DLC.

The data lakehouse ADC differs only marginally in semantics and initial focus from the DWC/cn pattern defined above.

Data fabric: essentially an extension of the LDW pattern, offers enhanced management and automation of data storage, population, access, and all aspects of data management in a diverse, distributed environment usually centered on a DWC in either of its flavors. ?This is supported via AI-enhanced and -extended active metadata?that reflects the real, changing, live business and computing environment across the entire set of data stores and processes.

Data mesh: proposes a highly distributed, analytics-focused environment that shuns conventional approaches to centralizing data in warehouses or lakes (for flexibility and agility in development and delivery), and instead promotes domain-driven design?to deliver data as a product. Such data products are realized and managed by combined business/IT teams within business domains, with a focus on embedded, distributed governance and infrastructure-as-a-platform.

With this, I conclude this series based on “Cloud Data Warehousing—Volume I”.

It’s time I started writing the next book: “Cloud Data Warehousing—Volume II: Implementing Data Warehouse, Lakehouse, Mesh, and Fabric.” As you can guess, the majority of the book will be devoted to diving deeper into the three emergent ADPs. I’m hoping to publish it early in the new year.

Matthias Mohler

MBA | Head of Data & AI Consulting at Swisscom | University Lecturer | Consulting Leader & Senior Advisor

1 年

Well said. I also like the term ?marchitecture“ for the mentioned types of diagrams that we see each and every day ??

回复

要查看或添加评论,请登录

Barry Devlin的更多文章

社区洞察

其他会员也浏览了