登录查看更多内容

The Future is Open

Fawad A. Qureshi

Field CTO @ Snowflake | LinkedIn Learning Instructor | Sustainability ??, Data Strategy, Business Transformation

发布日期: 2024年10月7日

I write weekly on different topics related to Data and AI. Feel free to subscribe to FAQ on Data newsletter and/or follow Fawad Qureshi on LinkedIn or FawadQureshi on X.

I was working with a significant public sector client, and they told me that across their entire data center estate, they have nearly 2.5 Exabytes of data, but the unique data is only about 300 Petabytes; the rest of the data is redundant copies of the same data copied into different formats for different systems as each system prefers their own approach and formats. This gets unmanageable and expensive very fast; I want to discuss a significant shift in the industry today and why the future of the data ecosystem would look much different.

Let's go back to school.

I want to talk about the ???????????? ??????? ??????? ???????????????????????????? ??????? ?????????????????????? book, i.e., Structured Computer Organization by Andrew Tanenbaum. We discussed that after decades of innovation in computing, the basic building blocks of all computing infrastructure follow the Von Neumann Architecture.

In a Von Neumann Architecture, the CPU provides computing power, memory stores transient data, the disk is used for persistent data, and the bus transfers data between the different units. Storage and the storage path are often the bottlenecks in this architecture.

The Parallel Computing Spectrum

In order to overcome the limitations of the storage and bus speeds, different vendors came up with different multi-processor approaches.

The early versions of SQL Server , Oracle Database and IBM DB2 followed a shared disk and memory approach to build a parallel system.

Later on 天睿 , Oracle Exadata and other vendors built a multi-computer model with a shared nothing approach. However, in these models you had to maintain a ratio of compute and storage. Even if you needed to add just compute to the model you couldn't do that without adding storage in equal ratio. You had to maintain the "balance" for that shared nothing MPP model to operate properly. This balance allows the system to deliver linear scalability for embarrassingly parallel tasks.

Hadoop introduce the concept of schema on read which was separation of storage and compute at the software level meaning you can postponing the structuring and processing of the data at the run time. You can use any tool or application to connect to the data.

The Cloud databases introduced separation of storage and compute at infrastructure level as Adam Storm discusses in his blog post. This means you can scale compute and storage independent of each other and build more flexible systems.

领英推荐

Financial Data in Oracle OCI with kdb+ Time-Series DB

Jakub Polec 1 年前

The Bank of the East - Replacing Hadoop with MinIO and…

MinIO 6 个月前

LinkedIn Handle 7 Trillion Messages Daily With Apache…

?? Saral Saxena ?????? 7 个月前

What is the big deal with Open Table Formats?

When Cloud Native Databases like Snowflake were created they used the native cloud storage and compute layers. Even though the data was stored in the Cloud Object stores, there was no way to access them without going through the database compute layer. If you wanted to query an open format such as Parquet, Avro etc you had to load the data first into the storage layer of the Cloud DBMS.

The evolution of cloud databases storage models

Then five years ago, most databases started adding support for reading open formats without loading them into the local storage. There was still functional and non-functional difference in terms of performance, security and scalability on what was possible but at least you were able to combine data stored within the database to something which was available in an open format in a public location.

What we now see is that formats such as Apache Iceberg being adopted by the community to standardize across different compute models. This allows the organizations to keep only a single copy of the data and allow multiple compute engines to flexibly connect to the same open data format in a read/write fashion. The onus is now on all the vendors to ensure that they provide the same functional and functional capabilities on the open format as they do it on their own native storage formats.

Why databases will stop charging for storage in the future?

Tomasz Tunguz argued that in the not so distance future, storage will no longer be part of a database engine. All data engines will connect to the same storage layer, thereby providing the most efficient data sharing across the enterprise and there will be no need to create separate copies of the data in proprietary formats of different systems.

Summary

This move towards an open architecture would enable enterprises to maintain a single copy of data, shared across systems and departments, slashing costs and simplifying data management. You can build Data Supply Chains both within and outside the organization using this approach. This evolution marks a major step toward reducing inefficiencies in how data is stored and processed, especially as companies aim to streamline operations in an increasingly digital world.

The future is open, are you open to it?

I write weekly on different topics related to Data and AI. Feel free to subscribe to FAQ on Data newsletter and/or follow Fawad Qureshi on LinkedIn or FawadQureshi on X.

Paul Mracek

Enterprise Account Executive @ Databricks

5 个月

Nice article and I always enjoy your FAQs. Keep writing! This one aligns with vision we see at Databricks too. A critical missing point from your article (and Tomasz diagram) is that all these engines must discover which tables exist and who is allowed to access them. You only want to define that in one place, not per engine, hence the importance of the [open] catalog. Both our companies are working on this piece too. I find it fascinating how the functions of a DBMS have been deconstructed while maintaining great performance. While we were at Teradata together, I never would have imagined customers could get the performance both Snowflake and Databricks provide with decentralized dictionaries and data storage, yet here we are.

2 次回应

Marco Ullasci

Data Solutions Architect, Singapore PEP

5 个月

What is consuming more the planet's resources? The duplication of data or the use of performance-inefficient storage formats? At the single system level I have no doubts about the answer: good luck winning a race using Iceberg (or delta tables) against an optimized Oracle Exadata. At the global level maybe the green (in)efficiency of the two approaches is on par, but so far I've not seen any number to support this.

4 次回应

查看更多评论

要查看或添加评论，请登录

Fawad A. Qureshi的更多文章

When AI Meets Ambiguity: How Humans Will Thrive in the Age of Machines

2025年3月17日

When AI Meets Ambiguity: How Humans Will Thrive in the Age of Machines

A few years ago, a conversation with a senior leader stuck with me. He said, “Everything we do requires data to operate.

2 条评论
Leading Companies in the Age of 5G and AI Transformation

2025年3月10日

Leading Companies in the Age of 5G and AI Transformation

At this year's MWC Barcelona, I had a chance to participate in a thought-provoking panel hosted by The Female…
Disrupt or Be Disrupted – The High-Stakes Race to Become AI-Focused

2025年3月7日

Disrupt or Be Disrupted – The High-Stakes Race to Become AI-Focused

At MWC Barcelona, Snowflake hosted an executive roundtable featuring telecom leaders discussing the realities of AI…

3 条评论
From Shared Responsibility to Shared Destiny: The Evolution of Cloud Governance

2025年3月3日

From Shared Responsibility to Shared Destiny: The Evolution of Cloud Governance

When the cloud was initially created, setting up demarcation lines of responsibility was important. This is why…
Return on Hassle: The hidden metric we all ignore

2025年2月24日

Return on Hassle: The hidden metric we all ignore

We often obsess over Return on Investment (ROI); how much we get back for every dollar, minute, or ounce of effort we…

1 条评论
When the Whole is Less Than the Sum of Its Parts

2025年2月17日

When the Whole is Less Than the Sum of Its Parts

We have all heard the adage: "The whole is greater than the sum of its parts." It is a comforting idea, suggesting…

2 条评论
What Data Cannot Tell You?

2025年2月10日

What Data Cannot Tell You?

I had spent years working with data, trusting that the numbers would always tell the whole story. My dashboards were…

3 条评论
The Subtle Power of Choice Architecture: Nudging Without Forcing

2025年2月3日

The Subtle Power of Choice Architecture: Nudging Without Forcing

Every day, we make countless decisions; what to eat, which products to buy, which cookies to accept on a website. But…
The Five Types of Wealth

2025年1月27日

The Five Types of Wealth

When you hear the word “wealth,” what comes to mind? If your first thought is money, you’re not alone. But here’s the…

8 条评论
Create Your Own Luck: The Four Types of Luck

2025年1月20日

Create Your Own Luck: The Four Types of Luck

Last year, I wrote an article on Increasing Your Luck Surface Area, discussing how to give yourself a chance to get…

4 条评论

See all articles

The Future is Open

Fawad A. Qureshi

Field CTO @ Snowflake | LinkedIn Learning Instructor | Sustainability ??, Data Strategy, Business Transformation

Let's go back to school.

The Parallel Computing Spectrum

领英推荐

What is the big deal with Open Table Formats?

Why databases will stop charging for storage in the future?

Summary

Fawad A. Qureshi的更多文章

社区洞察

其他会员也浏览了

Optimizing Apache Iceberg: Unlocking High Performance Across Platforms

Kafka clusters: real-life challenges and how to avoid them

Unlocking the Potential of Apache Iceberg

Intro to the Iceberg Kafka Connect Sink

Lakehouse Concurrency Controls: Are we too optimistic?

State

Setting Up an Open Lakehouse on your Laptop for Evaluation

Spanner: Google’s Globally-Distributed Database

Optimizing projections in Vertica

Deep dive into KIP-813: shareable (read-only) state stores in Kafka streams

Let's go back to school.

The Parallel Computing Spectrum

领英推荐

What is the big deal with Open Table Formats?

Why databases will stop charging for storage in the future?

Summary

Fawad A. Qureshi的更多文章

When AI Meets Ambiguity: How Humans Will Thrive in the Age of Machines

Leading Companies in the Age of 5G and AI Transformation

Disrupt or Be Disrupted – The High-Stakes Race to Become AI-Focused

From Shared Responsibility to Shared Destiny: The Evolution of Cloud Governance

Return on Hassle: The hidden metric we all ignore

When the Whole is Less Than the Sum of Its Parts

What Data Cannot Tell You?

The Subtle Power of Choice Architecture: Nudging Without Forcing

The Five Types of Wealth

Create Your Own Luck: The Four Types of Luck

社区洞察

其他会员也浏览了

Optimizing Apache Iceberg: Unlocking High Performance Across Platforms

Kafka clusters: real-life challenges and how to avoid them

Unlocking the Potential of Apache Iceberg

Intro to the Iceberg Kafka Connect Sink

Lakehouse Concurrency Controls: Are we too optimistic?

State

Setting Up an Open Lakehouse on your Laptop for Evaluation

Spanner: Google’s Globally-Distributed Database

Optimizing projections in Vertica

Deep dive into KIP-813: shareable (read-only) state stores in Kafka streams