The "Big Data" hype is over. Aftermath.
Google Trends for "big data" in the United States.

The "Big Data" hype is over. Aftermath.

"Big data" steadily declines in web search and this appears to be irreversible. The "Big Data" hype started in 2010, and today, we can confidently say that it's effectively over. What are we left with, as a result?

All technology hypes eventually fizzle, but most of them leave something useful in the aftermath. That includes not just technologies and products, but also learning. As I see it, the "Big Data" hype resulted in one major technological advance and one learning outcome:


Serverless and fully managed data stores

The hype gave us "serverless" data lakes and data warehouses, and they are going to stay relevant probably for decades ahead. Apache Spark (and Databricks), Amazon S3, Snowflake, Google BigQuery, and similar products have gained wide popularity.

Another big shift was the separation of storage and computing, where storage can be serverless, and computing fully managed and sometimes even provided by different vendors.

It remains yet to be seen if there will be open-source analogs of BigQuery or Snowflake with Postgres-level maturity suitable for self-hosting on a mixed pool of virtual and bare-metal machines. Or maybe it already exists, and I'm just not aware of it.


What about leveraging non-structured data?

Querying and extracting knowledge from non-structured data was promoted as one of the main benefits of "Big Data", but it failed to live up to the promise. At least, not at scale.

For now, SQL remains the main tool for querying "big data".


Medium data

One of the surprising learnings from all the fuss about "big data" is that it ... almost never exists. There are two explanations for this:

First, while the hype lasted, the hardware specs kept following Moore's law and jumped up a few orders of magnitude. That has shifted what you can perceive as "big data". No, 10TB is no longer "big data" as it once was. Nowadays, you can trivially procure a Windows machine with 1TB of RAM with tens of terabytes of local disk storage. For 99% of analytical workloads, that's effectively a "cloud warehouse" and a "data lake". Put a columnar database on it, and its performance will suffice for 99% of organizations out there. The biggest downside of it -- it won't look cool on the resume.

Second, as it turns out, "big data" is almost never needed in practice. Jordan Tigani, in his excellent article "Big Data is Dead," makes many great points. For instance, he argues that the actual median data volume of a real-life data warehouse lies in the 100GB range. That used to be big in times when MS Access was popular. But today, you can load the whole thing in memory on a machine that costs less than $100/month. Of course, you can use the "Big Data" technology with all its complexity even for a 100GB dataset, but would that be reasonable?

So what we've learned from the "Big Data" hype is that most of us who work in the data engineering field, actually work with medium data. An unexpected but logical, in retrospective, finding.

There never was a "Medium Data" hype. Probably, because it has a much cheaper (and thus less lucrative) and simpler technology stack. It has fewer barriers, requires much less ceremony, and, what's important, is approachable and usable for less technical business professionals.

Meanwhile, a new hype is in full swing. I guess, you know what I'm talking about :) It will be interesting to see what will remain in the aftermath of that hype.

Time will tell.

PS. Looking for a great data automation platform for medium data? Check out EasyMorph.


要查看或添加评论,请登录

Dmitry Gudkov的更多文章

社区洞察

其他会员也浏览了