Spark Internals
?? Learning through question-and-answer format has been ingrained in us since childhood, as it promotes active engagement and curiosity towards a subject. Despite using Apache Spark regularly, many users may not have a comprehensive understanding of how Spark works internally. I am going to cover spark internals in a comprehensive way:
Q). What is an Apache Spark?
Ans: Apache Spark is a general-purpose in-memory computing engine. Spark is a plug-and-play computing engine that needs two (2) things to work with:
1.???Storage – Local Storage, HDFS, Amazon S3.
2.???Resource Manager – YARN, Mesos, Kubernetes.
[ Note: Spark is a replacement for Map Reduce].
Q). What is the problem with Map Reduce?
Ans: For each Map Reduce Job we require 2 disk access (One is for reading and the other is for Writing). This was the bottleneck with Map Reduce. Because in an Iterative algorithm or something, every time it needs to access HDFS. This is the problem with Map Reduce.
领英推荐
[ Map Reduce works very slowly and the reason is a lot of DISK IO’s are required.]
Q). How does Spark resolve the Map Reduce Problem?
Ans: Spark reads the input from HDFS and stores the intermediate outputs in ‘in-memory’ only. When all processing is done then it transfers its final output to HDFS. [Spark solves the Map Reduce problem using the “in-memory” concept.]
[ But in the case of Map Reduce, Intermediate outputs are also stored in HDFS. Due to this, it needs more DISK IO’s. More DISK IO operations mean more time for computation.]
Q). How exactly the data is stored in Spark?
Ans: The basic unit which holds the data in spark is called an “RDD”. RDD stands for ‘Resilient distributed dataset’.
?Meaning of Resilient:
???Resilient means that it is resilient to failures. Suppose if we lose an RDD then we can again recover it back.