How to get most out of SPARK?
Not sure which one to choose SPARK or HADOOP? Let's have a quick tour of SPARK.
Spark is an open-source in-memory computation engine used for batch processing and stream processing (Batch processing means the processing of already stored data and stream processing refers to the processing of streams of data in real-time).On comparing Spark with Hadoop, it is 100 times faster than Hadoop Map Reduce because of its in-memory computation power. The basic unit of Spark is RDD (Resilience Distributed Dataset).
Why Spark?
For performing batch processing, Hadoop Map reduce can be used; for stream processing, Apache Storm is available; for interactive processing, we can use Apache Impala and for graph processing, Neo4j is present. The reason Spark came into existence because there was no powerful engine that could process both batch and real-time processing. Spark offers interactive processing, in-memory, stream, and graph processing with its components (Spark SQL, Spark Streaming, Spark R, Spark MLlib, and Spark GraphX)
Let’s discuss the fundamental unit of Spark i.e. RDD
RESILIENT DISTRIBUTED DATASETS (RDD)
RDD is a key abstraction of Spark. They are immutable and are distributed across cluster nodes. Decomposing RDD:
· Resilient: As it is fault-tolerant because damaged partitions due to node failures can be recomputed.
· Distributed: because data is present on multiple nodes.
· Datasets: It represents the data to deal with.
RDDs are logically partitioned so that they can be computed on different clusters. Also, they can be recomputed themselves on failure which makes them fault-tolerant too.
There are three ways by which RDDs can be created:
ü PARALLELIZED COLLECTION: Using parallelize method(sc.parallelize)
ü EXTERNAL DATASETS: Using textFile method(sc.textFile)
ü EXISTING RDDs: By applying transformation operation on existing RDDs.
RDD OPERATIONS
Two types of operations can be performed on RDDs:
· Transformations
· Actions
o TRANSFORMATIONS: Functions that take an Input RDD and result in one or more RDDs. Input RDD is not changed because RDDs are immutable, hence new RDD is formed on computation. Few examples are Map(), filter(), reduceByKey(), etc.
Transformations always create new RDDs, they execute only when action is called. That’s why they are called lazy evaluations. There are 2 kinds of transformations
o NARROW TRANSFORMATION: Transformations applied to the data residing in a single partition. E.g. Map(), flatMAP(), filter(), etc.
o WIDE TRANSFORMATION: Transformations applied to data with many partitions. E.g. reduceByKey(), groupByKey(), coalesce(), etc.
o ACTIONS: Operations that produce the final result of RDD computations and is responsible for sending the results from executor to driver. It does not create new RDD and provide motion to lazy RDDs. E.g. first(), take(), collect(), reduce(), count(), etc.
RDD PERSISTENCE AND CACHING
It is an optimization technique in Spark with which we can store the intermediate results which can be used in the future whenever required. There are two methods for this:- cache() and persist().
NEED of PERSISTENCE
Results of various operations or RDDs may require at many points in a process. If we compute each RDD again and again, it will be time and memory consuming. So, to avoid this, the concept of Persistence came into existence.
BENEFITS of PERSISTENCE
o Time-efficient
o Cost-efficient
o Lessen the execution time.