登录查看更多内容

Spark On YARN cluster, Some Observations

Abhishek Choudhary

Data Infrastructure Engineering in RWE/RWD | Healthtech DhanvantriAI

发布日期: 2017年4月24日

+ 关注

1. Number of partitions in Spark

Basic => n Number of cores = n partitions = Number of executors

Good => 2-3 times of number of cores.

So normally number of cores are directly proportional to partitions

2. Spark Dynamic allocation very well performs on YARN and this way it ensures that it doesn’t over utilizes the memory but this scenario might create problem if you want to juice up everything from cluster and control resources.

3. If Memory utilization is high but number of cores are free

If you have 100 GB YARN memory and minimum allocation is 1G , then ideally yarn has 100 slots. So if your min allocation is 4G, then yarn will have 25 slots. Each slots belongs to the capacity yarn can hold and process data.

Decrease the YARN container size yarn.scheduler.minimum-allocation-mb, make sure your cpu utilization is less than 80%

4. Spark Memory

We will calculate based on 10gb of executor memory

Spark Memory is Divided in following parts –

a) Spark Safety Region

Spark Safety region is meant for preventing OOM and ideally its 90% of total allocation (spark.storage.safetyFraction)

- 9gb is ideally usable spark memory

b) Shuffle & Storage

Spark.storage.memoryFraction = 60% = 5.4gb

Spark.shuffle.memoryFraction = 20% = 1.8gb

So for no caching, simply make

spark.storage.memoryFraction = 0

Spark.shuffle.memoryFraction = 1

c) Shuffle Safety Fraction

Spark.shuffke,safetyFraction = 80% of JVM heap

When we do reducebykey, reduce happens locally first and the shuffle memory is used for storing the intermittent result

5. Spark Executor Memory Allocation

Map Friendly like ETL job recommendation

a) The good number for executors, if you have multiple nodes are 4.

4 cores ~ 400%

5 cores ~ 425 – 435 %

So more than 4 cores may be not that efficient but this can be further optimized/tuned.

Entire observation is based on my experience working and facing number of challenges during the deployment in production. If you are working on Single Node, then it will be an entire different scenario & not a distributed environment so God Bless You :-)

Dev Lakhani

Co-Founder & Cloud Consultant

7 年

Spark on YARN is like driving a Mercedez Benz using cow muck as a fuel source. The number of hours I want back that I spent tuning the damn thing and using that 1980s UI oh man!

1 次回应

Mathieu Dumoulin

Lead Data Engineer - Coast Pay

7 年

In 5 a), if I understand correctly, you recommend 4 executors per node? If so, that goes against more modern versions of spark documentation which recommends 1 executor per node with lots of resources rather than many smaller ones to avoid the additional work of managing all of them. I'm sorry if I misunderstand!

查看更多评论

要查看或添加评论，请登录

Abhishek Choudhary的更多文章

Slack New Architecture

2020年1月1日

Slack New Architecture

This article presented the architecture/engineering decisions and changes brought in Slack to Scale it massively but by…
Unit Testing Apache Spark Applications in Scala or Python

2017年7月12日

Unit Testing Apache Spark Applications in Scala or Python

I saw a trend that developers usually find it very complicated to test spark application, may be no good library…
Apache Spark (Big Data) Cache - Something Nice to Know

2017年1月17日

Apache Spark (Big Data) Cache - Something Nice to Know

Spark Caching is one of the most important aspect of in-memory computing technology. Spark RDD Caching is required when…
Apache Airflow - if you are bored of Oozie & style

2016年12月12日

Apache Airflow - if you are bored of Oozie & style

Apache Airflow is an incubator Apache project for Workflow or Job Scheduler. DAG is the backbone of airflow.

1 条评论
Apache Spark Serialization issue

2016年11月13日

Apache Spark Serialization issue

Its bit common to face Spark Serialization Issue while working with Streaming or basic Spark Job org.apache.

3 条评论
Few points On Apache Spark 2.0 Streaming Over cluster

2016年8月23日

Few points On Apache Spark 2.0 Streaming Over cluster

Experience on Apache Spark 2.0 Streaming Over cluster Apache Spark streaming documentation has enough details about its…
Facebook Architecture (Technical)

2015年11月19日

Facebook Architecture (Technical)

Facebook's current architecture is: Web front-end written in PHP. Facebook's HipHop Compiler [1] then converts it to…
Apache Flink ,From a Developer point of View

2015年10月26日

Apache Flink ,From a Developer point of View

What is Apache Flink ? Apache Flink is an open source platform for distributed stream and batch data processing Flink’s…

2 条评论
Apache Spark (big Data) DataFrame - Things to know

2015年10月12日

Apache Spark (big Data) DataFrame - Things to know

What is the architecture of Apache Spark Now? What is the point of interaction in Spark? Previously it was RDD but…

6 条评论
Apache Spark 1.5 Released ...

2015年9月10日

Apache Spark 1.5 Released ...

Apache Spark 1.5 is released and now available to download https://spark.

See all articles

Spark On YARN cluster, Some Observations

Abhishek Choudhary

Data Infrastructure Engineering in RWE/RWD | Healthtech DhanvantriAI

Abhishek Choudhary的更多文章

社区洞察

其他会员也浏览了

Distributed File Systems, Simplified!

Advanced Optimization of LINQ Queries for Large-Scale Data Processing

YAML

Rocks DB: One of tool to achieve lowest latency

Understanding I/O Waits in MariaDB and Their Impact on Performance

Stack and Heap Memory in .NET

Router Rescue

Blazing SQL is now Open-Source

Choice between multithreading and multi-processing: When to use what

Stirring the Pot with my First Principles

Abhishek Choudhary的更多文章

Slack New Architecture

Unit Testing Apache Spark Applications in Scala or Python

Apache Spark (Big Data) Cache - Something Nice to Know

Apache Airflow - if you are bored of Oozie & style

Apache Spark Serialization issue

Few points On Apache Spark 2.0 Streaming Over cluster

Facebook Architecture (Technical)

Apache Flink ,From a Developer point of View

Apache Spark (big Data) DataFrame - Things to know

Apache Spark 1.5 Released ...

社区洞察

其他会员也浏览了

Distributed File Systems, Simplified!

Advanced Optimization of LINQ Queries for Large-Scale Data Processing

YAML

Rocks DB: One of tool to achieve lowest latency

Understanding I/O Waits in MariaDB and Their Impact on Performance

Stack and Heap Memory in .NET

Router Rescue

Blazing SQL is now Open-Source

Choice between multithreading and multi-processing: When to use what

Stirring the Pot with my First Principles