How to get most out of SPARK?

Avanish Chauhan

发布日期: 2019年12月15日

Not sure which one to choose SPARK or HADOOP? Let's have a quick tour of SPARK.

Spark is an open-source in-memory computation engine used for batch processing and stream processing (Batch processing means the processing of already stored data and stream processing refers to the processing of streams of data in real-time).On comparing Spark with Hadoop, it is 100 times faster than Hadoop Map Reduce because of its in-memory computation power. The basic unit of Spark is RDD (Resilience Distributed Dataset).

Why Spark?

For performing batch processing, Hadoop Map reduce can be used; for stream processing, Apache Storm is available; for interactive processing, we can use Apache Impala and for graph processing, Neo4j is present. The reason Spark came into existence because there was no powerful engine that could process both batch and real-time processing. Spark offers interactive processing, in-memory, stream, and graph processing with its components (Spark SQL, Spark Streaming, Spark R, Spark MLlib, and Spark GraphX)

Let’s discuss the fundamental unit of Spark i.e. RDD

RESILIENT DISTRIBUTED DATASETS (RDD)

RDD is a key abstraction of Spark. They are immutable and are distributed across cluster nodes. Decomposing RDD:

· Resilient: As it is fault-tolerant because damaged partitions due to node failures can be recomputed.

· Distributed: because data is present on multiple nodes.

· Datasets: It represents the data to deal with.

RDDs are logically partitioned so that they can be computed on different clusters. Also, they can be recomputed themselves on failure which makes them fault-tolerant too.

There are three ways by which RDDs can be created:

ü PARALLELIZED COLLECTION: Using parallelize method(sc.parallelize)

ü EXTERNAL DATASETS: Using textFile method(sc.textFile)

ü EXISTING RDDs: By applying transformation operation on existing RDDs.

RDD OPERATIONS

Two types of operations can be performed on RDDs:

· Transformations

· Actions

o TRANSFORMATIONS: Functions that take an Input RDD and result in one or more RDDs. Input RDD is not changed because RDDs are immutable, hence new RDD is formed on computation. Few examples are Map(), filter(), reduceByKey(), etc.

Transformations always create new RDDs, they execute only when action is called. That’s why they are called lazy evaluations. There are 2 kinds of transformations

o NARROW TRANSFORMATION: Transformations applied to the data residing in a single partition. E.g. Map(), flatMAP(), filter(), etc.

o WIDE TRANSFORMATION: Transformations applied to data with many partitions. E.g. reduceByKey(), groupByKey(), coalesce(), etc.

o ACTIONS: Operations that produce the final result of RDD computations and is responsible for sending the results from executor to driver. It does not create new RDD and provide motion to lazy RDDs. E.g. first(), take(), collect(), reduce(), count(), etc.

RDD PERSISTENCE AND CACHING

It is an optimization technique in Spark with which we can store the intermediate results which can be used in the future whenever required. There are two methods for this:- cache() and persist().

NEED of PERSISTENCE

Results of various operations or RDDs may require at many points in a process. If we compute each RDD again and again, it will be time and memory consuming. So, to avoid this, the concept of Persistence came into existence.

BENEFITS of PERSISTENCE

o Time-efficient

o Cost-efficient

o Lessen the execution time.

要查看或添加评论，请登录

Avanish Chauhan的更多文章

RDD (Resilient Distributed Datasets) operations in Scala !!

2019年12月23日

RDD (Resilient Distributed Datasets) operations in Scala !!

In one of my previous articles, I discussed two types of operations performed on RDDs: Transformations and Actions. In…
Why do we need Architects in an IT project?

2019年12月22日

Why do we need Architects in an IT project?

Sometimes when we work on an IT project, there seems a need to have an architect in our project. But, question is, if…
HBase Cheat Sheet

2019年12月18日

HBase Cheat Sheet

HBASE COMMANDS · GENERAL COMMANDS ü status: Gives status of the cluster ü version: Gives version of HBase ü whoami:…
Planning to move to the NoSQL database? Try HBase - Part 1

2019年12月18日

Planning to move to the NoSQL database? Try HBase - Part 1

HBase is a very vast topic on its own. In this blog, I will just touch the basics of HBase and how it works.
Struggling to transfer data between RDBMS and Hadoop? Try SQOOP.

2019年12月17日

Struggling to transfer data between RDBMS and Hadoop? Try SQOOP.

If you are trying to transfer data from RDBMS to Hadoop or vice versa and facing any trouble, then this article may be…
Data processing using Hive

2019年12月11日

Data processing using Hive

Dealing with a large amount of data in your project? Need to process petabytes of data without writing long methods in…
Few thoughts about KAFKA

2019年12月9日

Few thoughts about KAFKA

Apache Kafka is a scalable, fault-tolerant and distributed Publish-Subscribe messaging system that receives data from…

See all articles

Avanish Chauhan的更多文章

RDD (Resilient Distributed Datasets) operations in Scala !!

Why do we need Architects in an IT project?

HBase Cheat Sheet

Planning to move to the NoSQL database? Try HBase - Part 1

Struggling to transfer data between RDBMS and Hadoop? Try SQOOP.

Data processing using Hive

Few thoughts about KAFKA

社区洞察