登录查看更多内容

Apache Spark (Big Data) Cache - Something Nice to Know

Abhishek Choudhary

Data Infrastructure Engineering in RWE/RWD | Healthtech DhanvantriAI

发布日期: 2017年1月17日

Spark Caching is one of the most important aspect of in-memory computing technology.

Spark RDD Caching is required when RDD branches out or when RDD is used multiple times. Cache is as well known as break the lineage

Remember Spark Cache is lazy, so simply creating an RDD and call rdd.cache() will do nothing.

What happens if your parent RDD dies but you assumed that child RDD is already cached ?

ParentRDD
val childRDD = ParentRDD.map(...)
childRDD.first() // execute Action
childRDD.cache()
ParentRDD.unpersist()

// running the program in Spark 2.1.0 Cluster Mode

Storage view says, cache happened only 50%.

Fine lets try count the childRDD

// this should be finished within 35 ms
childRDD.count()

For above sample, I was expecting an execution time of about 30 ms but its more than that (60 ms), exploring further the DAG tree, I realised its doing more work than I expected.

Caching is Serialization & storing in Shared memory of cluster, so it has De-Serialization associated

To confirm again, I re-run the count operation and this time actually finished within expected time. WHY ?

Because first time count operation read through entire RDD and entire RDD was cached this time and next time when I ran the count again, it ran much faster because there was no need of RDD recreation.

Caching, partial or full, is based on the first action call made on RDD right after cache calling.

Notice the difference in Executor Computing Time, the green bar from the previous image. So now computing time is much lesser than previous partial cache status.

Few More important points about Spark Caching -

Spark Caching is not at all necessary, and it fits mostly with iterative style of computing or RDD re-creation is expensive.

Spark Cache uses LRU for releasing memory but its better to use unpersist manually if you are sure about rdd lifecycle.

Spark cache leads to overhead in JVM which is extremly important for Spark application, so always be sure about caching.
Tuning Spark memory Fraction will change the behavior of caching and data splill, so if you are changing, read and understand the concept properly.

spark.memory.fraction

spark.memory.storageFraction

要查看或添加评论，请登录

Abhishek Choudhary的更多文章

Slack New Architecture

2020年1月1日

Slack New Architecture

This article presented the architecture/engineering decisions and changes brought in Slack to Scale it massively but by…
Unit Testing Apache Spark Applications in Scala or Python

2017年7月12日

Unit Testing Apache Spark Applications in Scala or Python

I saw a trend that developers usually find it very complicated to test spark application, may be no good library…
Spark On YARN cluster, Some Observations

2017年4月24日

Spark On YARN cluster, Some Observations

1. Number of partitions in Spark Basic => n Number of cores = n partitions = Number of executors Good => 2-3 times of…

4 条评论
Apache Airflow - if you are bored of Oozie & style

2016年12月12日

Apache Airflow - if you are bored of Oozie & style

Apache Airflow is an incubator Apache project for Workflow or Job Scheduler. DAG is the backbone of airflow.

1 条评论
Apache Spark Serialization issue

2016年11月13日

Apache Spark Serialization issue

Its bit common to face Spark Serialization Issue while working with Streaming or basic Spark Job org.apache.

3 条评论
Few points On Apache Spark 2.0 Streaming Over cluster

2016年8月23日

Few points On Apache Spark 2.0 Streaming Over cluster

Experience on Apache Spark 2.0 Streaming Over cluster Apache Spark streaming documentation has enough details about its…
Facebook Architecture (Technical)

2015年11月19日

Facebook Architecture (Technical)

Facebook's current architecture is: Web front-end written in PHP. Facebook's HipHop Compiler [1] then converts it to…
Apache Flink ,From a Developer point of View

2015年10月26日

Apache Flink ,From a Developer point of View

What is Apache Flink ? Apache Flink is an open source platform for distributed stream and batch data processing Flink’s…

2 条评论
Apache Spark (big Data) DataFrame - Things to know

2015年10月12日

Apache Spark (big Data) DataFrame - Things to know

What is the architecture of Apache Spark Now? What is the point of interaction in Spark? Previously it was RDD but…

6 条评论
Apache Spark 1.5 Released ...

2015年9月10日

Apache Spark 1.5 Released ...

Apache Spark 1.5 is released and now available to download https://spark.

See all articles

Apache Spark (Big Data) Cache - Something Nice to Know

Abhishek Choudhary

Data Infrastructure Engineering in RWE/RWD | Healthtech DhanvantriAI

Abhishek Choudhary的更多文章

社区洞察

其他会员也浏览了

Deep Dive into Persist in Apache Spark

Deep Dive into Caching in Apache Spark

Exploring the World of Distributed Computing Frameworks: Empowering Scalable and Efficient Computing

Understanding DStreams in Apache Spark

Databricks Photon and its relation to Apache Spark

Demystifying Resilient Distributed Datasets (RDD) in Apache Spark

Understanding the CAP Theorem and its No Relationship to Scalability

Accelerating Spark: Databricks Photon Runtime

Apache Spark Optimizations - Compression

Abhishek Choudhary的更多文章

Slack New Architecture

Unit Testing Apache Spark Applications in Scala or Python

Spark On YARN cluster, Some Observations

Apache Airflow - if you are bored of Oozie & style

Apache Spark Serialization issue

Few points On Apache Spark 2.0 Streaming Over cluster

Facebook Architecture (Technical)

Apache Flink ,From a Developer point of View

Apache Spark (big Data) DataFrame - Things to know

Apache Spark 1.5 Released ...

社区洞察

其他会员也浏览了

Deep Dive into Persist in Apache Spark

Deep Dive into Caching in Apache Spark

Exploring the World of Distributed Computing Frameworks: Empowering Scalable and Efficient Computing

Understanding DStreams in Apache Spark

Databricks Photon and its relation to Apache Spark

Demystifying Resilient Distributed Datasets (RDD) in Apache Spark

Understanding the CAP Theorem and its No Relationship to Scalability

Accelerating Spark: Databricks Photon Runtime

Apache Spark Optimizations - Compression