登录查看更多内容

Befriending: Apache Hadoop and Spark

Abhishek Singh

Data Engineering Lead @ NatWest Group

发布日期: 2016年7月3日

If you've ever got into a discussion involving Big data then Apache Spark and Hadoop, the two most popular projects of Apache Software Foundation are synchronous terms. So, let's take both of them in a nutshell.

1) Doing things differently: Even though they are under umbrella of Big data frameworks, they don't exactly work for the same purpose. Hadoop on the large scale brings the distributed data infrastructure capabilities by placing massive data over multiple nodes belonging to a cluster using commodity hardware. On the other side, Apache Spark does not provide data storage capabilities, it just focuses on improving the data processing part.
Key take away- Hadoop brings in HDFS + Processing, Spark brings Processing to HDFS(May or May not).

2) Use any one or both, up to you: Hadoop, with HDFS ( Hadoop Distributed File System) brings in the Map Reduce model for data crunching whereas Apache Spark does not comes with it's exclusive file system. So, when using Hadoop you do not need Apache Spark's processing capabilities. On contrary, since Spark does not have a file system of its own, it can operate on HDFS (Hadoop Distributed File System) as it's data source.

3) Apache Spark, the Horse and Apache Hadoop, the Elephant : As you can rightly guess, Spark can run programs up to 100x faster than Hadoop Map Reduce in memory, or 10x faster on disk. A normal Hadoop Map Reduce Job looks like, Read the data from family of nodes, do some operation and finally write the results to a cluster. Again, start from Point 0, read the updated data, do some operation and write the results(Batch Processing). In Apache Spark, a normal Job looks like Read the whole data into memory(in-memory features), process and write the full result. This drastically improves the graph of Apache Spark in terms of Speed because of in-memory computation.

4) Do you really need 100x Faster Apache Spark? : Ask yourself on the basis of your use case. If your requirements are suitable for Batch processing then choose Hadoop else if your requirement demands analyzing the streaming data then Apache Spark is your call. Typical use case for spark should be like real time advertisement tracking, Cyber security, online recommendations, etc.

5) Fault recovery: Which one of them has quick fix?
In Apache Hadoop, the data is written to disk after every operation, it is considered to be highly resilient. Also, the recovery extended by the replication factors in HDFS. Similarly, even though Spark being in-memory, it provides similar fault recovery by virtue of Resilient Distributed Data sets across its network.

Befriending: Apache Hadoop and Spark

Abhishek Singh

Data Engineering Lead @ NatWest Group

更多精彩文章

社区洞察

其他会员也浏览了

Integration of LVM with Hadoop-Cluster To contribute limited storage of datanode on aws

Setting Up Hadoop Cluster on Top of AWS & Checking the Existence of Replica by Crashing the data node

Apache? Hadoop?

Hadoop And Apache SparK: Which Is Suitable for Your Domain of Work?

Hadoop Ecosystem

Mastering Big Data: 40 Essential Spark and Hadoop Questions to Ace Your Next Interview

Automating Hadoop Using Ansible

Getting started with Apache Spark

Hadoop Ecosystem

Starting My Journey into Big Data: Apache Spark, Apache Hadoop, and AWS EMR

Last 10 years of Data Engineering

2024年10月29日

Resume Driven Development (or Architecture) for Data Infrastructure

2024年9月2日

UNIX : Where there is a shell, there's a way

2016年6月1日

社区洞察

其他会员也浏览了

Integration of LVM with Hadoop-Cluster To contribute limited storage of datanode on aws

Setting Up Hadoop Cluster on Top of AWS & Checking the Existence of Replica by Crashing the data node

Apache? Hadoop?

Hadoop And Apache SparK: Which Is Suitable for Your Domain of Work?

Hadoop Ecosystem

Mastering Big Data: 40 Essential Spark and Hadoop Questions to Ace Your Next Interview

Automating Hadoop Using Ansible

Getting started with Apache Spark

Hadoop Ecosystem

Starting My Journey into Big Data: Apache Spark, Apache Hadoop, and AWS EMR