The Future is Open

The Future is Open

I write weekly on different topics related to Data and AI. Feel free to subscribe to FAQ on Data newsletter and/or follow Fawad Qureshi on LinkedIn or FawadQureshi on X.


I was working with a significant public sector client, and they told me that across their entire data center estate, they have nearly 2.5 Exabytes of data, but the unique data is only about 300 Petabytes; the rest of the data is redundant copies of the same data copied into different formats for different systems as each system prefers their own approach and formats. This gets unmanageable and expensive very fast; I want to discuss a significant shift in the industry today and why the future of the data ecosystem would look much different.

Let's go back to school.

I want to talk about the ???????????? ??????? ??????? ???????????????????????????? ??????? ?????????????????????? book, i.e., Structured Computer Organization by Andrew Tanenbaum. We discussed that after decades of innovation in computing, the basic building blocks of all computing infrastructure follow the Von Neumann Architecture.

The Von Neumann Architecture

In a Von Neumann Architecture, the CPU provides computing power, memory stores transient data, the disk is used for persistent data, and the bus transfers data between the different units. Storage and the storage path are often the bottlenecks in this architecture.

The Parallel Computing Spectrum

In order to overcome the limitations of the storage and bus speeds, different vendors came up with different multi-processor approaches.

The Parallel Computing Spectrum

The early versions of SQL Server , Oracle Database and IBM DB2 followed a shared disk and memory approach to build a parallel system.

Later on 天睿 , Oracle Exadata and other vendors built a multi-computer model with a shared nothing approach. However, in these models you had to maintain a ratio of compute and storage. Even if you needed to add just compute to the model you couldn't do that without adding storage in equal ratio. You had to maintain the "balance" for that shared nothing MPP model to operate properly. This balance allows the system to deliver linear scalability for embarrassingly parallel tasks.

Hadoop introduce the concept of schema on read which was separation of storage and compute at the software level meaning you can postponing the structuring and processing of the data at the run time. You can use any tool or application to connect to the data.

The Cloud databases introduced separation of storage and compute at infrastructure level as Adam Storm discusses in his blog post. This means you can scale compute and storage independent of each other and build more flexible systems.

What is the big deal with Open Table Formats?

When Cloud Native Databases like Snowflake were created they used the native cloud storage and compute layers. Even though the data was stored in the Cloud Object stores, there was no way to access them without going through the database compute layer. If you wanted to query an open format such as Parquet, Avro etc you had to load the data first into the storage layer of the Cloud DBMS.

The evolution of cloud databases storage models

Then five years ago, most databases started adding support for reading open formats without loading them into the local storage. There was still functional and non-functional difference in terms of performance, security and scalability on what was possible but at least you were able to combine data stored within the database to something which was available in an open format in a public location.

What we now see is that formats such as Apache Iceberg being adopted by the community to standardize across different compute models. This allows the organizations to keep only a single copy of the data and allow multiple compute engines to flexibly connect to the same open data format in a read/write fashion. The onus is now on all the vendors to ensure that they provide the same functional and functional capabilities on the open format as they do it on their own native storage formats.

Why databases will stop charging for storage in the future?

Tomasz Tunguz argued that in the not so distance future, storage will no longer be part of a database engine. All data engines will connect to the same storage layer, thereby providing the most efficient data sharing across the enterprise and there will be no need to create separate copies of the data in proprietary formats of different systems.

Credit Tomasz Tunguz

Summary

This move towards an open architecture would enable enterprises to maintain a single copy of data, shared across systems and departments, slashing costs and simplifying data management. You can build Data Supply Chains both within and outside the organization using this approach. This evolution marks a major step toward reducing inefficiencies in how data is stored and processed, especially as companies aim to streamline operations in an increasingly digital world.

The future is open, are you open to it?


I write weekly on different topics related to Data and AI. Feel free to subscribe to FAQ on Data newsletter and/or follow Fawad Qureshi on LinkedIn or FawadQureshi on X.


Paul Mracek

Enterprise Account Executive @ Databricks

5 个月

Nice article and I always enjoy your FAQs. Keep writing! This one aligns with vision we see at Databricks too. A critical missing point from your article (and Tomasz diagram) is that all these engines must discover which tables exist and who is allowed to access them. You only want to define that in one place, not per engine, hence the importance of the [open] catalog. Both our companies are working on this piece too. I find it fascinating how the functions of a DBMS have been deconstructed while maintaining great performance. While we were at Teradata together, I never would have imagined customers could get the performance both Snowflake and Databricks provide with decentralized dictionaries and data storage, yet here we are.

Marco Ullasci

Data Solutions Architect, Singapore PEP

5 个月

What is consuming more the planet's resources? The duplication of data or the use of performance-inefficient storage formats? At the single system level I have no doubts about the answer: good luck winning a race using Iceberg (or delta tables) against an optimized Oracle Exadata. At the global level maybe the green (in)efficiency of the two approaches is on par, but so far I've not seen any number to support this.

要查看或添加评论,请登录

Fawad A. Qureshi的更多文章

社区洞察

其他会员也浏览了