Hadoop versus Spark: Who’s winning?
A quick analysis of the hottest competition to hit the analytics world. Has Spark begun to overthrow Hadoop? Or will Hadoop hold the ground it has fortified over 10 years? Let’s find out.
image source:www.youtube.com
Apache Spark
Apache Spark is a Big Data analytics tool for performing general Big Data analysis on distributed computing clusters like Hadoop. It provides in-memory computations for increased speed of data processing over MapReduce. It runs on top of existing Hadoop clusters and accesses Hadoop data store (HDFS). It can process large quantities of data while taking advantage of both batch and streaming methods – this is popularly referred to as the lambda architecture. It provides programmers with an application programming interface centered on a data structure called resilient distributed dataset (RDD). RDD is a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault tolerant way.
Spark versus Hadoop
Hadoop, for many years was the leading open source Big Data framework but recently the newer and more advanced Spark has become more popular. However, they do not perform exactly the same tasks. They are also not mutually exclusive, as they are able to work together. Spark is reported to work up to 100 times faster than Hadoop in certain circumstances, but it does not provide its own distributed storage system. For this reason, many Big Data analysis projects involve installing Spark on top of Hadoop. Spark’s advanced analytics applications can make use of data stored using Hadoop Distributed File System (HDFS).
Spark stores data in-memory whereas Hadoop stores data on disk. Spark copies its operations from the distributed physical storage into faster RAM memory. Hadoop’s MapReduce, on the other hand, writes and reads from slow, clunky mechanical hard drives.
Spark’s functionality for handling advanced data processing tasks such as real time stream processing and machine learning is way ahead of what is possible with Hadoop alone. Real-time processing means that data can be fed into an analytical application the moment it is captured, and insights are immediately fed back to the user through a dashboard, to allow action to be taken.
In addition to simple map and reduce operations, Spark supports SQL queries, streaming data, and complex analytics such as machine learning and graph algorithms out-of-the-box. Users can combine all these capabilities seamlessly in a single workflow. It also lets the user write applications quickly in Java, Scala or Python. This helps them to create and run their own applications on their familiar programming languages and easy to build parallel apps.
Spark is unable to handle if the intermediate data is greater than the memory size of the node. It makes use of the journaling (also known as “Recomputation”) for providing resiliency in case there is a node failure by chance as a result we can conclude that the recovery behavior in case of node failure is just similar as that in case of Hadoop except for the fact that the recovery process would be much faster
Who wins?
The current trends are in favor of the in-memory techniques like the Apache Spark as the industry trends seem to be rendering a positive feedback for it. Other Big Data analytics tools, although efficient, are in some way lacking the speed. So to conclude, I can state that, the choice of Hadoop or Spark depends on the user based case.
#Bigdata #BringitOn
Founder and CEO at EURASIAN Gaming
8 年LevelDB wins.
Data-Driven Decisions / Data Governance / Process Improvement / Complex Systems Integration
8 年It's not only wrong title, it's wrong point of view of all the article. it's better to talk in which cases Spark+Hadoop are better than other Hadoop configurations. There is no universal configuration which will fit any project.
Data Aficionado
8 年Spark and MapReduce run in the Hadoop YARN framework on top of HDFS. It's surprising how many articles out there try to compare Hadoop and Spark. In fact, map and reduce is really an algorithm that's implemented in Apache MapReduce.
Data Quality Analyst at Veraltis Asset Management - Part of B2Holding
8 年Apples and pears
Business Focus | Executive MBA
8 年Spark wins