Hadoop is declining, what are the alternatives?
Mehdi Charafeddine
IBM Distinguished Engineer | Posts about Data, Data Governance, AI, and Engineering
Hadoop is undoubtedly experiencing a hard fall right now. As with any technology, it went through multiple cycles, from innovation, enlightenment, and now a plateau of productivity.
When Hadoop came out in the mid-2000s, it gave the promise of massively parallel, horizontally scalable data processing. In addition, an entire ecosystem of tools was developed by the Open-Source community to plug into and on top of Hadoop, such as Hive, HBase, Spark, and so on... What ensued was a global and widespread adoption of Hadoop-based data platform initiatives by companies looking to address the need of tackling big datasets.
In this post, we will explore the reasons behind Hadoop's decline and what alternatives are available that are starting to change the landscape.
Hadoop is more than one tool
Hadoop consists of a rich ecosystem of tools that aim to solve multiple challenges of big data. The four most notable tools are:
- Hadoop Core: A set of common libraries and utilities used and shared by other Hadoop modules.
- HDFS: a distributed file system to store raw data, without any structure defined up-front, using commodity block storage.
- YARN (Yet Another Resource Negotiator): is a resource management platform that manages compute resources in clusters, using them to schedule applications. It performs scheduling and resource allocation across the Hadoop nodes.
- MapReduce: is a distributed computing engine that allows parallel data processing across multiple physical machines. It makes it possible to bring together ("reduce") all the processing from multiple machines into one result set.
- Hive: is a distributed, fault-tolerant data warehouse system that enables analytics at a large scale.
Hadoop is typically deployed to implement two main use cases:
- Data Lake: a repository of structured and unstructured data stored in its raw format.
- Data Warehouse: a repository of structured and refined data that caters to specific use cases to provide meaningful business insights.
How about Spark?
Spark is a technology that came out in 2014 to address the performance limitations of Map Reduce which used a two-stage design to perform computations and saves the intermediary results on disk. Instead, Spark introduced several optimizations, the most important being its ability to run on an in-memory cluster.
Up until recently, Spark's fate was tightly linked to Hadoop, because it relied on the Yarn scheduler to orchestrate the various steps of job execution.
Even though Spark was designed to run on multiple cluster managers, it was historically used primarily with Yarn and it was embedded in most Hadoop distributions. Over the years there have been multiple major iterations of Spark. Today, with the rise of Kubernetes as the most popular scheduler, Spark has become a first-class citizen of Kubernetes and has recently removed its dependency from Hadoop.
Why is Hadoop declining?
It is now widely accepted that Hadoop is on the decline, the point of this article is not to debate this fact, but to rather examine what led to that and what alternatives are available in the market.
Key technological advancements
In the past 10 years, we have witnessed astonishing technical advances in the areas of compute and storage. For example, storage costs have dropped from 10¢/GB to 2¢/GB.
Data explosion
In an increasingly connected world, it is no surprise that enterprises are generating data exponentially. Analytics and AI have now become part of every company's core strategy and are core business differentiators. Hadoop's architecture has several limitations:
Hadoop's limitations
- Separation of compute and storage: While storage requirements tend to grow very quickly, data processing (compute) needs to be elastic in order to be cost-efficient. Hadoop couples compute and storage in the same nodes, hence, cannot accommodate compute elasticity. The main problem here is that storage needs tend to grow steadily and significantly faster than compute. On other hand, while compute requirements also have an upward trend, they tend to spike by nature. It is not possible to achieve compute elasticity with Hadoop, and this leads to higher costs.
- Operational cost: Hadoop systems are complex to install, operate, tune, and scale. Hadoop requires highly skilled professionals and is notoriously difficult to manage at scale. This problem gets amplified as the number of Hadoop clusters and their size grow.
- Namespace design: HDFS is built upon the single-node namespace server architecture. Because the name-node is a single container of the file system metadata, it becomes a limiting factor for file system growth. In order to make metadata operations fast, the namenode loads the entire namespace into its memory, and therefore the size of the namespace is limited by the amount of RAM available to the namenode. While Hadoop clusters can have thousands of data nodes, the choking point is ultimately the namenode.
- Small files: besides the fact that retrieving, and processing small files is very inefficient in Hadoop, there is a technological limit to about 500 million small files can be managed in HDFS. This is due to the namenode architecture limitation.
What are the alternatives?
The alternatives for Hadoop look very different whether you need to deploy to an on-premises vs a cloud infrastructure. Since Hadoop contains multiple components, we will consider storage and compute, and the data warehouse capabilities separately.
Storage
Object storage technology emerged as the standard for storing data reliably while minimizing the cost and maximizing utility. It was built on commodity hardware with a simple uniform limitless namespace, atomic data writes, all available over HTTP protocol. It also used software-based erasure coding techniques to enable data protection, which cost significantly less than traditional replication-based data protection techniques.
Object storage was initially used for backup, archiving, and use cases for serving up static content. However, it has now become a widely accepted alternative to HDFS.
Object storage implementations are available for both cloud and on-prem deployments. Examples of such implementations are AWS s3, Azure Blob Storage, and Google Cloud Storage.
Object Storage offers the following benefits over HDFS:
- Cost: HDFS relies on data to be replicated across multiple nodes, hence cannot take advantage of erasure coding.
- Elasticity: S3 is actually infinite storage in the cloud but HDFS is not.
- SLA (availability and durability): Amazon claims 99.999999999% durability and 99.99% availability. On the other hand, typical Hadoop availability is about 99.9%.
However, since Object Store is not a filesystem, there are some limitations to be aware of. There are some expected behaviors missing, as follows:
- Directory listings are only “eventually consistent”, which means that after a call to an Object Store API that stores or modifies data, there is a small time window where the data has been durably stored, but not yet visible to all read requests.
- File overwrites and deletes are only eventually consistent: readers may get old data.
- There is no rename operation; it is mimicked in the S3A client by listing a directory and copying each file underneath, one-by-one.
Note that Object Storage can also be used with Hadoop, and when doing that, an Output committer must be used such as the s3a committer.
Compute
For many years now, Spark has become the preferred compute engine for Hadoop, due to its improvement in performance with its in-memory processing. While Spark's earlier versions were heavily reliant on Yarn, recent Spark versions (especially 3.x) have a clear focus on running Spark on Kubernetes. This is a big architecture improvement that allows elastic scaling of the compute resources independently from the storage resources.
With the recent Apache Spark 3.1 release in March 2021, the Spark on Kubernetes project is now officially declared as production-ready and Generally Available.
With this milestone, running your batch workloads as containerized Spark 3.1 applications is definitely a good way to evolve your Hadoop architecture towards more flexibility and reduce the load on your existing clusters.
While Spark can be deployed quickly on Kubernetes, running a production-grade reliable deployment can be challenging and requires significant expertise. Hence, it is wise to consider turnkey solutions, such as:
Note that running Spark on Kubernetes does not offer any performance gains. Actually, it has just recently caught up with Spark on Yarn.
Data Warehouse
One of the most popular components that are used with Hadoop is the Hive data warehouse. Next-generation cloud-native data warehouses have significant improvements over Hive:
- SaaS model: although Hive can also be found as a SaaS model, most Hadoop deployments are on-premises. A SaaS model significantly reduces the cost of ownership and the speed of new deployments
- Serverless: users do not need to manage or deploy any servers
- Deployment model: Cloud vs on-premises
When to consider alternatives to Hadoop?
- High Operational cost
- High data growth
- New Data Platform implementations
- New Data use cases
- Architecture simplification
In many cases, it will not make sense to simply migrate away from Hadoop. There are too many clients with too much investment made for it to just go away, and the cost of migration and testing of a new platform would need to be borne. But alternatives should be considered for new use cases and specific components of a big data solution.
Conclusion
Various architecture strategies exist to tackle Hadoop’s limitations:
- For data lake use cases, replacing HDFS with object storage will lower the cost. But specific measures need to be taken to address the increased latency and the lack of filesystem features.
- For cloud migration or greenfield use cases, selecting a more modern data warehouse combined with object storage provides the most optimal solution.
- Cloud Hadoop distributions such as GCP DataProc, AWS EMR, and Azure HDInsights can provide a lower operational risk and a pay-per-use commercial model for lift-and-shift migration scenarios. But this approach does not bring any other benefit.
- For Spark workloads, one must first consider the cost of migrating and refactoring the existing Spark jobs from older versions to Spark 3.x before you can take advantage of a truly elastic compute architecture.
Regardless of the chosen approach, having a “beyond Hadoop” strategy for big data is important, because it will enable you to plan and prepare to address tomorrow’s data explosion challenges.
Senior Solution Architect at Ramsey International LLC
3 年Interesting points in the article. I guess my view is that issue is that like all technology people have tried to torture it into doing things it was not intended to do - and then becoming disappointed with the outcome.
The elephant ran over by himself?
Manager at Deloitte Consulting | Data Strategy | Pricing | Analytics | Mobility | Automotive
3 年Straight to the point. Great article, interesting thoughts, Thanks a lot!