Apache Spark - Memory Management

Apache Spark - Memory Management

Spark's memory management for executor nodes involves several key concepts and components that work together to ensure efficient execution of tasks.

The memory is mainly divided into 1. Overhead and 2. JVM Heap.

Lets focus on JVM Heap as of now.

Components of Executor Memory

  • JVM Heap Memory:

  1. Execution Memory: Used for intermediate data computation, such as shuffles, joins, sorts, and aggregations.
  2. Storage Memory: Used for caching RDDs and DataFrames, broadcasting variables, and storing intermediate shuffle data.


  • Reserved Memory: Spark also allows fixed memory for running spark engine.
  • User Memory: Memory used for user-defined objects and variables. This is not managed by Spark but by the user’s application code.

  • The ratio of Spark memory and User memory is 60/40 of the (Total memory-reserved memory).
  • Spark memory is used by Dataframe caching and operations, while User memory is ued for:

  1. User-defined data-structures
  2. Spark internal metadata (it maintains information about data, execution plans, and the state of various components, logical and physical plans, catalogs etc)
  3. UDFs, RDD conversions and lineage and dependency.



This Spark memory is further divided into Storage Memory Pool and Executor Memory Pool, in the ratio of 50/50.

The storage pool is used for caching dataframes and the executor pool is used to buffer dataframes.

Suppose we are performing a join on two dataframes. So you will need to buffer thse dataframes and that will occur in Executor memory pool. The executor memory pool is short-ived and is flushed out as soon as its job is done.

If you use caching mechanisms for dataframes, the dataframes are cached in storage pool, and are there till you uncache them or your executor is running. So, the storage memory pool is long term.


Now, lets say we had 4 core-executor. The 4 cores are basically 4 slots or 4 threads, where you tasks run in parallel, within the same JVM.

Initially, under static memory managemnet, the executor memory was divided equally among all the slots available in a JVM.

But with the advent of Unified memory management, the executor memory is divided only among ACTIVE tasks, and as demanded (in context of amount of memory demnded by each slot). In case the entire execution memory is consumed, the memory manager can allocate demanded memory from the storage memory.


But there is a rigid boundary that needs to be defined between storage and executor memory, so that the storage memory doesnt spill the data that it MUST not. In that case whenever executor memory tries to consume storage memory more than the SET-boundary, you encounter OOM situation. The same happens in case where the storage memory tries to free up the executor memory.

(More than 5 cores causes execssive memory management contention. So, its better to have at max 5 cores).

For large memory requirements in your spark applications, you may mix on-heap and off-heap memory by enabling spark.memory.offHeap.enabled. This might help you reduce GC delays.

This off-heap memory will be used to add some extra space to the SPARK MEMORY

Key Parameters

  1. spark.executor.memory: Sets the memory allocated to each executor.
  2. spark.executor.memoryOverhead: Allocates additional off-heap memory per executor.
  3. spark.driver.memory: Sets the memory allocated to the driver.
  4. spark.driver.memoryOverhead: Allocates additional off-heap memory to the driver.
  5. spark.memory.offHeap.enabled: Enables or disables off-heap memory.
  6. spark.memory.offHeap.size: Sets the size of off-heap memory.
  7. spark.memory.fraction: Determines the fraction of heap space used for execution and storage.
  8. spark.memory.storageFraction: Determines the fraction of memory used for storage within spark.memory.fraction.
  9. GC Options: Configure JVM GC options to optimize performance.

要查看或添加评论,请登录

Kumar Preeti Lata的更多文章

社区洞察

其他会员也浏览了