Why Databases Won't Charge for Storage in the Future
The database is being unbundled. Historically, a database like Snowflake sold both data storage & a query engine (& the computing power to execute the query). That’s step 1 above.
But, customers are pushing for a deeper separation of compute & storage
A lot of big customers want to have open file formats to give them the options…So data interoperability is very much a thing and our AI products can generally act on data that is sitting in cloud storage as well.
We do expect a number of our large customers are going to adopt Iceberg formats and move their data out of Snowflake where we lose that storage revenue and also the compute revenue associated with moving that data into Snowflake.
Instead of locking the data in one database, customers prefer to have it in open formats like Apache Arrow, Apache Parquet, Apache Iceberg.
As data use inside of an enterprise has expanded, so has the diversity of demands on that data.
领英推荐
Rather than copying it each time for a different purpose whether it’s exploratory analytics
This saves money : Storage is about $280m-300m overall for Snowflake.
As a reminder, about 10% to 11% of our overall revenue is associated with storage.
But it also simplifies architectures.
It also ushers in an epoch where the query engines will compete for different workloads with price & performance. Snowflake may be better for large-scale BI ; Databricks’ Spark for AI data pipelines ; MotherDuck for interactive analytics.
Data warehouse vendors have marketed the separation of storage & computein the past. But, that message was about scaling the system to handle bigger data within their own product.
Customers demand a deeper separation - a world in which databases don’t charge for storage.
Data Solutions Architect, Singapore PEP
5 个月The market is right even when it's (environmentally) wrong: no one can win against it. in the process R.i.P. computational efficiency sacrificed on the altar of open formats. Additional caching mechanisms will come to the rescue once the tradeoffs of the opens formats become unacceptable. Just like it happened with many data federation solutions in the past.
Experienced Data Leader | Builder | Father
11 个月This has been one of the intentions of the big data revolution—why are we saying that we’re inching towards such an era in Cloud Data Warehousing? Presto, Trino… even Redshift to some extent—have had independently scalable compute and external storage layers. I guess I’m just confused? Maybe this is specific to Snowflakes strategy and not their technology. I worked on an implementation on snowflake in 2019-2020 that I didn’t design. The design called for Snowflake manage storage but I used extensively external schemas pointed at S3 and saved storage costs. Using Athena right now is extremely cost effective, and its compute is completely decoupled from its storage layer.
Tomasz Tunguz a question. is the main reason customers are calling for the decoupling "cost of storage" , "interoperability" for different usecases (as in 3), or is it to make the current combination of storage compute more efficient? (as in 2). The reason for the question is that databricks and snowflake has indeed made a lot of changes and have been switching their architecture towards these open formats. As you mention, this is to handle more data more efficiently in their own platform. but....this might be its own reward for customers since more efficiency brings down cost in the current operations (where storage is not the main cost) so can you tell us a bit more about the motivations you see?
This assumes that all data must be moved to "storage" first. What about keeping data in the source system and query this when needed. Right now we are already doing one copy of all data we think we need. Storage his cheaper now and fast enough for most uses. Streaming data might be the exception! But there is still duplication and it can create inconsistent data (between storage and source) An interesting future would be when data (mostly) remains in source systems. The data that has most speed requirements will be moved to storage. The rest remains. Everything gets a metadata tag. Queries are done on metadata and retrieved from either What would be needed for this? Faster transfer? Other type of indexing and tagging of source system data? Better search? Different source systems? Snowflakes invention was to hash everything in a different way and that meant they could distribute data differently while keeping qry and retrieval time as before (or faster). Could something similar be made with a hybrid storage approach?
Great article Tomasz Tunguz! This is the biggest trend in the data space right now and is going to have a huge impact. This solves one of the biggest challenges that enterprises face - data lock-in. Freeing your core data from the clutches of vendors enables better value generation from it - from a variety of tools that can process that single copy of data. As an analytics vendor, open table formats allow us to employ specialized compute engines tailored for specialized workloads such as event data, time series, graph etc. instead of being forced to work with a lowest common denominator SQL engine; and we can do that without having to make copies of the data into proprietary stores. Further, it lets us monetize better by being able to own and charge for compute too. Game changing!