Apache Spark (Big Data) Cache - Something Nice to Know
Abhishek Choudhary
Data Infrastructure Engineering in RWE/RWD | Healthtech DhanvantriAI
Spark Caching is one of the most important aspect of in-memory computing technology.
Spark RDD Caching is required when RDD branches out or when RDD is used multiple times. Cache is as well known as break the lineage
Remember Spark Cache is lazy, so simply creating an RDD and call rdd.cache() will do nothing.
What happens if your parent RDD dies but you assumed that child RDD is already cached ?
ParentRDD
val childRDD = ParentRDD.map(...)
childRDD.first() // execute Action
childRDD.cache()
ParentRDD.unpersist()
// running the program in Spark 2.1.0 Cluster Mode
Storage view says, cache happened only 50%.
Fine lets try count the childRDD
// this should be finished within 35 ms
childRDD.count()
For above sample, I was expecting an execution time of about 30 ms but its more than that (60 ms), exploring further the DAG tree, I realised its doing more work than I expected.
Caching is Serialization & storing in Shared memory of cluster, so it has De-Serialization associated
To confirm again, I re-run the count operation and this time actually finished within expected time. WHY ?
Because first time count operation read through entire RDD and entire RDD was cached this time and next time when I ran the count again, it ran much faster because there was no need of RDD recreation.
Caching, partial or full, is based on the first action call made on RDD right after cache calling.
Notice the difference in Executor Computing Time, the green bar from the previous image. So now computing time is much lesser than previous partial cache status.
Few More important points about Spark Caching -
- Spark Caching is not at all necessary, and it fits mostly with iterative style of computing or RDD re-creation is expensive.
Spark Cache uses LRU for releasing memory but its better to use unpersist manually if you are sure about rdd lifecycle.
- Spark cache leads to overhead in JVM which is extremly important for Spark application, so always be sure about caching.
- Tuning Spark memory Fraction will change the behavior of caching and data splill, so if you are changing, read and understand the concept properly.
spark.memory.fraction spark.memory.storageFraction