Memory Overhead
Arabinda Mohapatra
LWD-17th Jan 2025 || Data Engineer @ Wells Fargo || Pyspark, Alteryx,AWS, Stored Procedure, Hadoop,Python,SQL,Airflow,Kakfa,IceBerg,DeltaLake,HIVE,BFSI,Telecom
In Spark, memory overhead refers to the additional memory allocated beyond the user-defined executor memory. This overhead is crucial for managing various internal operations and ensuring smooth execution of tasks.
What is Memory Overhead?
Memory overhead in Spark includes memory used for:
2. Shuffle Operation:
3. Broadcast Variable
4. Internal Data Structure
Spark internal data structures such as task as metadata ,storage and job details required additional memory
5.Network Buffers
How Much Memory is Allocated?
The amount of memory allocated for overhead is typically a fraction of the total executor memory. By default, Spark allocates 10% of the executor memory for overhead, but this can be configured using the spark.yarn.executor.memoryOverhead parameter. For example:
from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName("MemoryOverheadExample") \
.set("spark.executor.memory", "4g") \
.set("spark.executor.memoryOverhead", "512m")
sc = SparkContext(conf=conf)