登录查看更多内容

Apache Spark - Memory Management

Kumar Preeti Lata

Microsoft Certified: Senior Data Analyst/ Senior Data Engineer | Prompt Engineer | Gen AI | SQL, Python, R, PowerBI, Tableau, ETL| DataBricks, ADF, Azure Synapse Analytics | PGP Cloud Computing | MSc Data Science

发布日期: 2024年8月3日

Spark's memory management for executor nodes involves several key concepts and components that work together to ensure efficient execution of tasks.

The memory is mainly divided into 1. Overhead and 2. JVM Heap.

Lets focus on JVM Heap as of now.

Components of Executor Memory

JVM Heap Memory:

Execution Memory: Used for intermediate data computation, such as shuffles, joins, sorts, and aggregations.
Storage Memory: Used for caching RDDs and DataFrames, broadcasting variables, and storing intermediate shuffle data.

Reserved Memory: Spark also allows fixed memory for running spark engine.
User Memory: Memory used for user-defined objects and variables. This is not managed by Spark but by the user’s application code.

The ratio of Spark memory and User memory is 60/40 of the (Total memory-reserved memory).
Spark memory is used by Dataframe caching and operations, while User memory is ued for:

User-defined data-structures
Spark internal metadata (it maintains information about data, execution plans, and the state of various components, logical and physical plans, catalogs etc)
UDFs, RDD conversions and lineage and dependency.

This Spark memory is further divided into Storage Memory Pool and Executor Memory Pool, in the ratio of 50/50.

领英推荐

Master sync.Pool in Go: Boost Performance with Smarter…

Archit Agarwal 2 个月前

Memory Management in .NET

Mohamed Sameh 7 个月前

Multithreaded Server vs Single-Threaded Server - Deep…

Gaurav Singh 1 个月前

The storage pool is used for caching dataframes and the executor pool is used to buffer dataframes.

Suppose we are performing a join on two dataframes. So you will need to buffer thse dataframes and that will occur in Executor memory pool. The executor memory pool is short-ived and is flushed out as soon as its job is done.

If you use caching mechanisms for dataframes, the dataframes are cached in storage pool, and are there till you uncache them or your executor is running. So, the storage memory pool is long term.

Now, lets say we had 4 core-executor. The 4 cores are basically 4 slots or 4 threads, where you tasks run in parallel, within the same JVM.

Initially, under static memory managemnet, the executor memory was divided equally among all the slots available in a JVM.

But with the advent of Unified memory management, the executor memory is divided only among ACTIVE tasks, and as demanded (in context of amount of memory demnded by each slot). In case the entire execution memory is consumed, the memory manager can allocate demanded memory from the storage memory.

But there is a rigid boundary that needs to be defined between storage and executor memory, so that the storage memory doesnt spill the data that it MUST not. In that case whenever executor memory tries to consume storage memory more than the SET-boundary, you encounter OOM situation. The same happens in case where the storage memory tries to free up the executor memory.

(More than 5 cores causes execssive memory management contention. So, its better to have at max 5 cores).

For large memory requirements in your spark applications, you may mix on-heap and off-heap memory by enabling spark.memory.offHeap.enabled. This might help you reduce GC delays.

This off-heap memory will be used to add some extra space to the SPARK MEMORY

Key Parameters

spark.executor.memory: Sets the memory allocated to each executor.
spark.executor.memoryOverhead: Allocates additional off-heap memory per executor.
spark.driver.memory: Sets the memory allocated to the driver.
spark.driver.memoryOverhead: Allocates additional off-heap memory to the driver.
spark.memory.offHeap.enabled: Enables or disables off-heap memory.
spark.memory.offHeap.size: Sets the size of off-heap memory.
spark.memory.fraction: Determines the fraction of heap space used for execution and storage.
spark.memory.storageFraction: Determines the fraction of memory used for storage within spark.memory.fraction.
GC Options: Configure JVM GC options to optimize performance.

Analytics Almanac

2,111 位关注者

要查看或添加评论，请登录

Kumar Preeti Lata的更多文章

Shallow vs. Deep Pagination in GraphQL:

2025年3月4日

Shallow vs. Deep Pagination in GraphQL:

Pagination is a crucial technique in GraphQL for managing large datasets efficiently, especially for platforms like…
Pagination

2025年3月4日

Pagination

What is Pagination? Pagination is the technique of dividing a large set of data into smaller, manageable chunks or…
GraphQL

2025年3月4日

GraphQL

Imagine you’re at a restaurant. With a typical menu (like REST API), you have to choose a full meal even if you only…
Groq-3: The AI Accelerator That’s Changing the Game Like Never Before

2025年3月3日

Groq-3: The AI Accelerator That’s Changing the Game Like Never Before

In the world of AI, speed isn’t just nice to have — it’s everything. Training large language models and processing…
How DeepSeek Hunts Down Answers Like Never Before

2025年3月3日

How DeepSeek Hunts Down Answers Like Never Before

If you've been keeping an eye on AI advancements, you’ve probably heard the buzz about DeepSeek — the model that seems…
How ‘Attention Is All You Need’ Transformed AI Like Never Before

2025年3月3日

How ‘Attention Is All You Need’ Transformed AI Like Never Before

Back in 2017, a research paper with a bold title — "Attention Is All You Need" — quietly landed in the AI community…
Challenges and Risks of Agentic AI: Can AI Making Its Own Decisions Be Controlled?

2025年2月7日

Challenges and Risks of Agentic AI: Can AI Making Its Own Decisions Be Controlled?

Artificial Intelligence (AI) has come a long way—from simple rule-based automation to highly intelligent and adaptive…
When to Use a Simple AI Agent vs. an Agentic AI System

2025年2月6日

When to Use a Simple AI Agent vs. an Agentic AI System

As artificial intelligence continues to evolve, businesses and developers face an important question: should they use a…
AI Agent vs Agentic AI: Understanding the Difference

2025年2月6日

AI Agent vs Agentic AI: Understanding the Difference

The world of artificial intelligence (AI) is rapidly evolving, and new terminology continues to surface, often causing…
Data Lake vs. Data Warehouse: Which to Choose and When?

2025年1月10日

Data Lake vs. Data Warehouse: Which to Choose and When?

In the data-driven world of today, organizations are generating and collecting massive amounts of data. To extract…

1 条评论

See all articles

Apache Spark - Memory Management

Kumar Preeti Lata

Microsoft Certified: Senior Data Analyst/ Senior Data Engineer | Prompt Engineer | Gen AI | SQL, Python, R, PowerBI, Tableau, ETL| DataBricks, ADF, Azure Synapse Analytics | PGP Cloud Computing | MSc Data Science

Components of Executor Memory

领英推荐

Key Parameters

Analytics Almanac

2,111 位关注者

Kumar Preeti Lata的更多文章

社区洞察

其他会员也浏览了

What are Kubernetes Node Affinity and Pod Affinity?

Coresident Programming in Modern Computer Architecture

Top 5 Open source monitoring tools for Kubernetes

Demystifying the Kubernetes Controller Manager: Managing Cluster State Behind the Scenes

Top 5 Open source monitoring tools for Kubernetes

Top 5 Open source monitoring tools for Kubernetes

#KubernetesSummarySeries - Module 8 [k8s Storage]

Optimizing Performance: The Power of Caching in Software Development

Database Performance Fundamentals

Kubernetes Architecture

Components of Executor Memory

领英推荐

Key Parameters

Analytics Almanac

2,111 位关注者

Kumar Preeti Lata的更多文章

Shallow vs. Deep Pagination in GraphQL:

Pagination

GraphQL

Groq-3: The AI Accelerator That’s Changing the Game Like Never Before

How DeepSeek Hunts Down Answers Like Never Before

How ‘Attention Is All You Need’ Transformed AI Like Never Before

Challenges and Risks of Agentic AI: Can AI Making Its Own Decisions Be Controlled?

When to Use a Simple AI Agent vs. an Agentic AI System

AI Agent vs Agentic AI: Understanding the Difference

Data Lake vs. Data Warehouse: Which to Choose and When?

社区洞察

其他会员也浏览了

What are Kubernetes Node Affinity and Pod Affinity?

Coresident Programming in Modern Computer Architecture

Top 5 Open source monitoring tools for Kubernetes

Demystifying the Kubernetes Controller Manager: Managing Cluster State Behind the Scenes

Top 5 Open source monitoring tools for Kubernetes

Top 5 Open source monitoring tools for Kubernetes

#KubernetesSummarySeries - Module 8 [k8s Storage]

Optimizing Performance: The Power of Caching in Software Development

Database Performance Fundamentals

Kubernetes Architecture