登录查看更多内容

Hadoop versus Spark: Who’s winning?

Naveen Joshi

AI, Robotics & Smart Cities Expert | 600K+ Followers

发布日期: 2016年4月14日

A quick analysis of the hottest competition to hit the analytics world. Has Spark begun to overthrow Hadoop? Or will Hadoop hold the ground it has fortified over 10 years? Let’s find out.

image source:www.youtube.com

Apache Spark

Apache Spark is a Big Data analytics tool for performing general Big Data analysis on distributed computing clusters like Hadoop. It provides in-memory computations for increased speed of data processing over MapReduce. It runs on top of existing Hadoop clusters and accesses Hadoop data store (HDFS). It can process large quantities of data while taking advantage of both batch and streaming methods – this is popularly referred to as the lambda architecture. It provides programmers with an application programming interface centered on a data structure called resilient distributed dataset (RDD). RDD is a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault tolerant way.

Spark versus Hadoop

Hadoop, for many years was the leading open source Big Data framework but recently the newer and more advanced Spark has become more popular. However, they do not perform exactly the same tasks. They are also not mutually exclusive, as they are able to work together. Spark is reported to work up to 100 times faster than Hadoop in certain circumstances, but it does not provide its own distributed storage system. For this reason, many Big Data analysis projects involve installing Spark on top of Hadoop. Spark’s advanced analytics applications can make use of data stored using Hadoop Distributed File System (HDFS).

Spark stores data in-memory whereas Hadoop stores data on disk. Spark copies its operations from the distributed physical storage into faster RAM memory. Hadoop’s MapReduce, on the other hand, writes and reads from slow, clunky mechanical hard drives.

Spark’s functionality for handling advanced data processing tasks such as real time stream processing and machine learning is way ahead of what is possible with Hadoop alone. Real-time processing means that data can be fed into an analytical application the moment it is captured, and insights are immediately fed back to the user through a dashboard, to allow action to be taken.

In addition to simple map and reduce operations, Spark supports SQL queries, streaming data, and complex analytics such as machine learning and graph algorithms out-of-the-box. Users can combine all these capabilities seamlessly in a single workflow. It also lets the user write applications quickly in Java, Scala or Python. This helps them to create and run their own applications on their familiar programming languages and easy to build parallel apps.

Spark is unable to handle if the intermediate data is greater than the memory size of the node. It makes use of the journaling (also known as “Recomputation”) for providing resiliency in case there is a node failure by chance as a result we can conclude that the recovery behavior in case of node failure is just similar as that in case of Hadoop except for the fact that the recovery process would be much faster

Who wins?

The current trends are in favor of the in-memory techniques like the Apache Spark as the industry trends seem to be rendering a positive feedback for it. Other Big Data analytics tools, although efficient, are in some way lacking the speed. So to conclude, I can state that, the choice of Hadoop or Spark depends on the user based case.

#Bigdata #BringitOn

Saverio Castellano

Founder and CEO at EURASIAN Gaming

8 年

LevelDB wins.

Elena Makurochkina 'Mark'

Data-Driven Decisions / Data Governance / Process Improvement / Complex Systems Integration

8 年

It's not only wrong title, it's wrong point of view of all the article. it's better to talk in which cases Spark+Hadoop are better than other Hadoop configurations. There is no universal configuration which will fit any project.

4 次回应

Stephen Moon

Data Aficionado

8 年

Spark and MapReduce run in the Hadoop YARN framework on top of HDFS. It's surprising how many articles out there try to compare Hadoop and Spark. In fact, map and reduce is really an algorithm that's implemented in Apache MapReduce.

2 次回应

Andrej Cigoj

Data Quality Analyst at Veraltis Asset Management - Part of B2Holding

8 年

Apples and pears

Baldo Taberner Aguas

Business Focus | Executive MBA

8 年

Spark wins

查看更多评论

要查看或添加评论，请登录

查看全部

Hadoop versus Spark: Who’s winning?

Naveen Joshi

AI, Robotics & Smart Cities Expert | 600K+ Followers

Apache Spark

Spark versus Hadoop

Who wins?

更多精彩文章

社区洞察

其他会员也浏览了

Hadoop vs spark

Do I need Hadoop to be a good Data Scientist?

What Are The Key Differences Between Spark And Hadoop?

Hadoop And Apache SparK: Which Is Suitable for Your Domain of Work?

Hadoop Ecosystem

Comparison between Hadoop, Spark and Storm

HDFS (Hadoop Distributed File System):

Hadoop Architecture Made Easy!

Introduction:

Apache Spark

Spark versus Hadoop

Who wins?

Parking Enforcement: How It Creates A Better Parking Experience

2023年9月29日

Using License Plate Recognition To Streamline Your Parking

2023年9月28日

How Real-Time Parking Availability Data Support Electric Vehicle Charging Infrastructure

2023年9月27日

How Kiosks And POS Management Aid In Enforcing Local Traffic Regulations

2023年9月25日

10 Ways Real-Time Parking Data Can Support Sustainable Transportation

2023年9月24日

The Role Of Smart Meters In Ensuring Fairness And Compliance

2023年9月23日

Want To Enhance Customer Satisfaction? Try Automated Parking Management

2023年9月11日

How Dynamic Pricing Can Revolutionize Parking

2023年9月10日

How a Real-Time Parking Availability System Benefits Urban Transportation

2023年9月9日

How Automated Gateless Parking Works

2023年9月8日

社区洞察

其他会员也浏览了

Hadoop vs spark

Do I need Hadoop to be a good Data Scientist?

What Are The Key Differences Between Spark And Hadoop?

Hadoop And Apache SparK: Which Is Suitable for Your Domain of Work?

Hadoop Ecosystem

Comparison between Hadoop, Spark and Storm

HDFS (Hadoop Distributed File System):

Hadoop Architecture Made Easy!

Introduction: