登录查看更多内容

Apache Spark vs. Hadoop MapReduce

Malini Shukla

Senior Data Scientist || Hiring || 6M+ impressions || Trainer || Top Data Scientist || Speaker || Top content creator on LinkedIn || Tech Evangelist

发布日期: 2018年4月18日

Comparison between Apache Spark vs. Hadoop MapReduce

Apache Spark is an open-source, lightning fast big data framework which is designed to enhance the computational speed. Hadoop MapReduce, read and write from the disk as a result it slows down the computation. While Spark can run on top of Hadoop and provides a better computational speed solution.

Introduction

Apache Spark – It is an open source big data framework. It provides faster and more general purpose data processing engine. Spark is basically designed for fast computation. It also covers wide range of workloads for example batch, interactive, iterative and streaming.
Hadoop MapReduce – It is also an open source framework for writing applications. It also processes structured and unstructured data that are stored in HDFS. Hadoop MapReduce is designed in a way to process a large volume of data on a cluster of commodity hardware. MapReduce can process data in batch mode.

Speed

Apache Spark – Spark is lightning fast cluster computing tool. Apache Spark runs applications up to 100x faster in memory and 10x faster on disk than Hadoop. Because of reducing the number of read/write cycle to disk and storing intermediate data in-memory Spark makes it possible.
Hadoop MapReduce – MapReduce reads and writes from disk, as a result, it slows down the processing speed.

Difficulty

Apache Spark – Spark is easy to program as it has tons of high-level operators with RDD – Resilient Distributed Dataset.
Hadoop MapReduce – In MapReduce, developers need to hand code each and every operation which makes it very difficult to work.

Easy to Manage

Apache Spark – Spark is capable of performing batch, interactive and Machine Learning and Streaming all in the same cluster. As a result makes it a completedata analytics engine. Thus, no need to manage different component for each need. Installing Spark on a cluster will be enough to handle all the requirements.
Hadoop MapReduce – As MapReduce only provides the batch engine. Hence, we are dependent on different engines. For example- Storm, Giraph, Impala, etc. for other requirements. So, it is very difficult to manage many components.

Real-time analysis

Apache Spark – It can process real time data i.e. data coming from the real-time event streams at the rate of millions of events per second, e.g. Twitter data for instance or Facebook sharing/posting. Spark’s strength is the ability to process live streams efficiently.
Hadoop MapReduce – MapReduce fails when it comes to real-time data processing as it was designed to perform batch processing on voluminous amounts of data.

Learn: Apache Hive vs Spark SQL: Feature wise comparison

latency

Apache Spark – Spark provides low-latency computing.
Hadoop MapReduce – MapReduce is a high latency computing framework.

Interactive mode

Apache Spark – Spark can process data interactively.
Hadoop MapReduce – MapReduce doesn’t have an interactive mode.

Streaming

Apache Spark – Spark can process real time data through Spark Streaming.
Hadoop MapReduce – With MapReduce, you can only process data in batch mode.

Ease of use

Apache Spark – Spark is easier to use. Since, its abstraction (RDD) enables a user to process data using high-level operators. It also provides rich APIs in Java, Scala, Python, and R.
Hadoop MapReduce – MapReduce is complex. As a result, we need to handle low-level APIs to process the data, which requires lots of hand coding.

Recovery

Apache Spark – RDDs allows recovery of partitions on failed nodes by re-computation of the DAG while also supporting a more similar recovery style to Hadoop by way of checkpointing, to reduce the dependencies of an RDDs.
Hadoop MapReduce – MapReduce is naturally resilient to system faults or failures. So, it is a highly fault-tolerant system.

Scheduler

Apache Spark – Due to in-memory computation spark acts its own flow scheduler.
Hadoop MapReduce – MapReduce needs an external job scheduler for example, Oozie to schedule complex flows.

Fault tolerance

Apache Spark – Spark is fault-tolerant. As a result, there is no need to restart the application from scratch in case of any failure.
Hadoop MapReduce – Like Apache Spark, MapReduce is also fault-tolerant, so there is no need to restart the application from scratch in case of any failure.

Security

Apache Spark – Spark is little less secure in comparison to MapReduce because it supports the only authentication through shared secret password authentication.
Hadoop MapReduce – Apache Hadoop MapReduce is more secure because of Kerberos and it also supports Access Control Lists (ACLs) which are a traditional file permission model.

Learn: Important Terminologies and Concepts in Apache Spark

Cost

Apache Spark – As spark requires a lot of RAM to run in-memory. Thus, increases the cluster, and also its cost.
Hadoop MapReduce – MapReduce is a cheaper option available while comparing it in terms of cost.

Language Developed

Apache Spark – Spark is developed in Scala.
Hadoop MapReduce – Hadoop MapReduce is developed in Java.

Category

Apache Spark – It is data analytics engine. Hence, it is a choice for Data Scientist.
Hadoop MapReduce – It is basic data processing engine.

Read Complete Article>>

Chandra Shekhar Singh

Software Engineer || Ex TCS || Ex HCL || IIMA (EPABA)

6 年

Spark is fast and its Dataframe/sql syntax makes you productive fast since is easy to adopt than map reduce framework for data processing??

Artur Neivas

Analista Infraestrutura TI Jr | Bradesco

6 年

Gustavo Mantoan da uma olhada

1 次回应

Valerio Morfino

Public Sector Industry Managing Partner at DXC Technology

6 年

Nice and easy-to-read comparison between Spark and MR

Parag Gedam

AI Leadership Enthusiast, Data Driven Innovation and Cloud Excellence for future growth Data Engineering Leader( Azure, Google Cloud, HDFS, Hive, Sqoop, HBase, Scala, Spark, Airflow)

6 年

Very useful!!

查看更多评论

要查看或添加评论，请登录

Malini Shukla的更多文章

Top 9 Computer Vision Project Ideas for Beginners

2020年1月21日

Top 9 Computer Vision Project Ideas for Beginners

Understand the visual world around us Computer Vision Projects Computer vision is the most powerful and compelling type…
12 Cool Data Science project ideas with source code - "Strengthen your Resume"

2019年11月13日

12 Cool Data Science project ideas with source code - "Strengthen your Resume"

INTRODUCTION Data Science, a field that brings out wonders almost every second day and that’s why it is often regarded…

3 条评论
Python Coding Interview Questions for Experienced - Python FAQ's

2019年9月30日

Python Coding Interview Questions for Experienced - Python FAQ's

Firstly, If you are here, you probably already have a interview scheduled so my friend all the very best with that…
How Data Science is the Backbone of Retail?

2019年7月16日

How Data Science is the Backbone of Retail?

Data Science is having an increasing impact on business models in all industries. And in today’s digital world, data…
How to Get The Coolest & The Sexiest Job Of the Century- “Become a Data Scientist”

2019年7月9日

How to Get The Coolest & The Sexiest Job Of the Century- “Become a Data Scientist”

“The goal is to turn data into information, and information into insight” Data Scientist is an analytical data expert…
What’s the Best programming Language to Start a Career in Data Science?

2019年6月25日

What’s the Best programming Language to Start a Career in Data Science?

If you are thinking which programming languages should I learn to Master data Science in 2019? Then you are at the…

1 条评论
11 Reason Why TensorFlow is So Popular

2019年6月15日

11 Reason Why TensorFlow is So Popular

TensorFlow Features | Why TensorFlow Is So Popular TensorFlow gives us an interactive multiplatform programming…
20 Deep Learning Terminologies You Must Know

2019年6月14日

20 Deep Learning Terminologies You Must Know

Deep Learning Terminologies a. Recurrent Neuron It’s one of the best from the Deep Learning Terminologies.

2 条评论
TensorFlow Performance Optimization – Tips To Improve Performance

2019年6月12日

TensorFlow Performance Optimization – Tips To Improve Performance

Ways for TensorFlow Performance Optimization There a variety of ways through which you can optimize your hardware tools…
Top 9 Reasons Why QlikView is Best in BI

2019年6月11日

Top 9 Reasons Why QlikView is Best in BI

QlikView Features Below are the 9 Features of QlikView, which gives us the importance of QlikView, let’s discuss them:…

See all articles

Apache Spark vs. Hadoop MapReduce

Malini Shukla

Senior Data Scientist || Hiring || 6M+ impressions || Trainer || Top Data Scientist || Speaker || Top content creator on LinkedIn || Tech Evangelist

Comparison between Apache Spark vs. Hadoop MapReduce

Introduction

Malini Shukla的更多文章

社区洞察

其他会员也浏览了

Hadoop And Apache SparK: Which Is Suitable for Your Domain of Work?

Building Scalable Data Pipelines with Apache Spark & Hadoop

Getting started with Apache Spark

Breaking Down Hadoop: How HDFS, MapReduce, and YARN Work Together to Conquer Big Data

Introduction:

Comparing Spark and MapReduce: The Pros and Cons of Two Popular Big Data Processing Frameworks on the Hadoop Ecosystem

Spark vs. Hadoop: A Comprehensive Comparison for Big Data Processing

Mastering Big Data: 40 Essential Spark and Hadoop Questions to Ace Your Next Interview

Task Efficiency: A Comparative Study of Hadoop MapReduce, Apache Spark

Comparison between Apache Spark vs. Hadoop MapReduce

Introduction

Malini Shukla的更多文章

Top 9 Computer Vision Project Ideas for Beginners

12 Cool Data Science project ideas with source code - "Strengthen your Resume"

Python Coding Interview Questions for Experienced - Python FAQ's

How Data Science is the Backbone of Retail?

How to Get The Coolest & The Sexiest Job Of the Century- “Become a Data Scientist”

What’s the Best programming Language to Start a Career in Data Science?

11 Reason Why TensorFlow is So Popular

20 Deep Learning Terminologies You Must Know

TensorFlow Performance Optimization – Tips To Improve Performance

Top 9 Reasons Why QlikView is Best in BI

社区洞察

其他会员也浏览了

Hadoop And Apache SparK: Which Is Suitable for Your Domain of Work?

Building Scalable Data Pipelines with Apache Spark & Hadoop

Getting started with Apache Spark

Breaking Down Hadoop: How HDFS, MapReduce, and YARN Work Together to Conquer Big Data

Introduction:

Comparing Spark and MapReduce: The Pros and Cons of Two Popular Big Data Processing Frameworks on the Hadoop Ecosystem

Spark vs. Hadoop: A Comprehensive Comparison for Big Data Processing

Mastering Big Data: 40 Essential Spark and Hadoop Questions to Ace Your Next Interview

Task Efficiency: A Comparative Study of Hadoop MapReduce, Apache Spark