Apache Spark - Memory Management
Kumar Preeti Lata
Microsoft Certified: Senior Data Analyst/ Senior Data Engineer | Prompt Engineer | Gen AI | SQL, Python, R, PowerBI, Tableau, ETL| DataBricks, ADF, Azure Synapse Analytics | PGP Cloud Computing | MSc Data Science
Spark's memory management for executor nodes involves several key concepts and components that work together to ensure efficient execution of tasks.
The memory is mainly divided into 1. Overhead and 2. JVM Heap.
Lets focus on JVM Heap as of now.
Components of Executor Memory
This Spark memory is further divided into Storage Memory Pool and Executor Memory Pool, in the ratio of 50/50.
领英推荐
The storage pool is used for caching dataframes and the executor pool is used to buffer dataframes.
Suppose we are performing a join on two dataframes. So you will need to buffer thse dataframes and that will occur in Executor memory pool. The executor memory pool is short-ived and is flushed out as soon as its job is done.
If you use caching mechanisms for dataframes, the dataframes are cached in storage pool, and are there till you uncache them or your executor is running. So, the storage memory pool is long term.
Now, lets say we had 4 core-executor. The 4 cores are basically 4 slots or 4 threads, where you tasks run in parallel, within the same JVM.
Initially, under static memory managemnet, the executor memory was divided equally among all the slots available in a JVM.
But with the advent of Unified memory management, the executor memory is divided only among ACTIVE tasks, and as demanded (in context of amount of memory demnded by each slot). In case the entire execution memory is consumed, the memory manager can allocate demanded memory from the storage memory.
But there is a rigid boundary that needs to be defined between storage and executor memory, so that the storage memory doesnt spill the data that it MUST not. In that case whenever executor memory tries to consume storage memory more than the SET-boundary, you encounter OOM situation. The same happens in case where the storage memory tries to free up the executor memory.
(More than 5 cores causes execssive memory management contention. So, its better to have at max 5 cores).
For large memory requirements in your spark applications, you may mix on-heap and off-heap memory by enabling spark.memory.offHeap.enabled. This might help you reduce GC delays.
This off-heap memory will be used to add some extra space to the SPARK MEMORY
Key Parameters