A (Recent) History of Batch Data

A (Recent) History of Batch Data

For more similar articles, follow along at cloudconstructed.substack.com.

One of the most interesting shifts in technology began twenty years ago with Google’s release of?MapReduce, a programming model and reference implementation for distributed processing of big datasets on clustered computers.

Throughout the 20th century, most cutting edge technologies coming out of the US were either invented or made viable in government research labs. The airplane was of course invented by the Wright brothers, but shortly thereafter the US government?purchased?a plane from them and designed repeatable manufacturing processes, making the technology much more accessible. A similar pattern exists with the transistor, radar, the internet, and other pivotal innovations of the last hundred years.

These technologies were operationalized by or with help from the government and have since been applied in private industries and made broadly available to end users. Starting about twenty years ago, leading internet companies flipped that script in one respect by developing technologies in-house that are now being used across government agencies. These technologies kicked off the notorious “big data” movement.

In 2004, Jeff Dean and Sanjay Ghemawat of Google released a paper titled?MapReduce: Simplified Data Processing on Large Clusters,?which described a programming model for distributed processing of large datasets on commodity hardware. The datasets which powered Google Search were growing so quickly that conventional methods of processing could not keep up, and Google needed a new method. Over the last eighteen years since that release, the innovation behind MapReduce has led to a Cambrian explosion of data processing tools and has brought the (now over-used) term “big data” into modern corporate vernacular.

I decided to map out a few of the key technologies from this movement and place them on a timeline to give more context into where each innovation fits in the landscape.

No alt text provided for this image

To briefly recap, MapReduce (and the original?Google File System)?catalyzed the movement in 2004. Two years later, Hadoop and HDFS (Hadoop Distributed File System)?were released by teams at Yahoo as an implementation of the MapReduce programming paradigm. Around this same time, Amazon debuted their public cloud with the release of S3 (Simple Storage Service),?pioneering cloud object storage with nearly unlimited scalability and eventual consistency. By 2010, these distributed file systems had become so popular as an easy way to store data across many homogeneous nodes that a new class of query engines emerged (Dremel, Hive, Impala, Drill, etc.) to help analytical users query these distributed datasets and quickly derive value

As the data lake ecosystem became an enterprise standard, practitioners wanted to interact with them how they had always interacted with databases by leveraging transactions, change streams, and record-level updates. Out of this need emerged projects now just gaining steam like Hudi, Iceberg, and Delta Lake. Data formats have begun to evolve from the traditional row-based orientation to columnar formats like Parquet and Arrow since large-scale analytical workloads work better with columnar.

I’d like to keep my analysis simple with three insights that I gather from viewing these technologies on a twenty year timeline.

The Data Lakehouse architecture is the result of two decades of modular components being developed at increasing levels of abstraction.?Hadoop and early file systems proved that petabytes of data can be stored and transformed on lots of cheap hardware. With this new ability, thousands of companies started storing data in Hadoop clusters to the point that analysts expected to write SQL to query the data in the way they were used to doing in data warehouses, leading to the query engine layer. And finally, at the highest levels of abstraction, table and data formats have been defined to standardize storage and compute and to make the “Lakehouse” a transactional system. Looking ahead, I’d expect this trend to continue and for more companies building around this architecture to emerge.

The rate of innovation in a category is driven more by low switching costs than by room for improvement.?That is to say — the fact that Hadoop is still broadly used sixteen years after its release is not to say that Hadoop doesn’t have its challenges, but it does show how heavy the lift is to migrate off systems that are both stateful and at scale. Similarly, Spark has a few drawbacks including no native support for real-time processing, an expensive cost structure, and problems with handling small file sizes. However, because one’s choice of batch processing framework is a multi-year decision, this layer of the stack has a much longer lifecycle than query engines, for example, which are more plug and play.

Columnar data formats continue to spread closer to the end data consumer.?Columnar formats (where data is stored contiguously column by column) are generally more efficient for analytical use cases, whereas row-based formats (each row of data is stored contiguously) better serve transactional use cases and certain types of search queries. The amount of data out of which organizations try to mine value has exploded (not coincidentally) along with the timeline shown in the chart above. This increased volume makes data compression all the more important, which columnar formats enable more easily than row-based formats. The result is that these columnar formats have moved from the file level (Parquet), to in-memory (Arrow), and even more recently to network protocols and client SDKs (Arrow Flight).

I look forward to watching this space over the next few years! If you spend time in any of these areas or have any feedback on what I’ve put together, I’d love to hear from you at?[email protected].

Dean Meyer

Partner at Sequoia Capital

2 年

Once again, awesome

回复

Delta Lake wasn't technically open sourced until early 2019, but it had already been available for quite some time as part of the Databricks product and is the most mature of the listed table formats. Also, keep in mind that while the term "Hive" has historically been referenced as a query engine, Hive has multiple components: other products listed under query engines here depend on Hive compatible metastores even though the code that actually does the querying, such as Spark SQL (not listed) has since taken its place.

回复
Alexander Filipchik

Software, Data, and ML infrastructure.

2 年

I think A LOT of icons are missing. The stack is very saturated for sure.

Raj K.

Head of Databricks Capability @ Apex Systems | Databricks Solutions Architect Champion

2 年

Where is Flink, If Hadoop and MR are the competition Spark is in a sweet spot :)

回复
Sarwar Bhuiyan

Technical Architect | Software Engineering, Data and Stream processing, Cloud architectures | Product Management | Technology Consulting | Field Advisory

2 年

Where’s Apache Kafka?

回复

要查看或添加评论,请登录

Sam Crowder的更多文章

  • Study the Legends: Infrastructure IPOs

    Study the Legends: Infrastructure IPOs

    For any early-stage VC, it’s important to study past success patterns as you evaluate new opportunities. At Bain…

    6 条评论
  • WebAssembly -- A Primer

    WebAssembly -- A Primer

    For more similar content, follow along at cloudconstructed.substack.

    5 条评论

社区洞察

其他会员也浏览了