登录查看更多内容

Optimize your EMR cluster

Dinesh Gupta

Engineering Leader

发布日期: 2022年1月27日

+ 关注

Below spark, properties are really important to boost cluster performance?

Number of cores per executor
Number of executors per instance
Driver and executor memory
Spark partition size

Number of cores per executor

Assigning a lower number of virtual cores leads to a high number of executors, and causes a larger amount of I/O operations.

Based on the historical and bulk of records, five virtual cores for each executor leads to better results in any size of the cluster,

Example?:

for?a 5.2xlarge machine with 8vCPUs you may opt for?3 cores per executor.

Number of executors per instance

One virtual core out of the 8vCPUs is reserved for Hadoop daemons, hence 7vCPUs are used for processing.

?The number of executors per instance :

(total number of virtual cores per instance - 1)/ spark.executors.cores

?The number of executors per instance =(8-1)/3=7/3= 2 approx.

Driver and Executor memory

The driver and executor memory can be decided as below.

Leave 1 GB for the Hadoop daemons.

Total executor memory :

(Total RAM per instance- 1GB for Hadoop daemons)/ number of executors per instance

Total executor memory = (64-1)/2=63/2= 31GB

Here out of 31Gb, 90% memory is allocated to executors and 10% for yarn executor memory overhead

Hence,

Spark executors memory = total executor memory * 0.90

Spark executors memory = 31 * 0.9 = 27.9 => 28GB

spark.executor.memoryOverhead = total executor memory * 0.10

spark.executor.memoryOverhead = 31 * 0.1 = 3GB

It can be further divided as? below

Spark.executor.memory & spark.driver.memory 28GB

spark.driver.memoryOverhead & spark.executor.memoryOverhead 3GB

领英推荐

Concept Of Parallelism To Upload The Split Data While…

Govind Bhardwaj 4 年前

Demystifying the concept Of Parallelism when Upload…

Mohamed Afrid 4 年前

BigData-Hadoop

Pawar Suvarna 4 年前

Spark partition size

?If we consider partition size as 200MB

The number of partitions will be calculated?based on the below 2 parameters

1. size of input data?

2.partition size

So the number of partitions: Total input data size/partition size

something as below :

Size of each partition 200MB

The total size of input 100GB

Number of partitions (100*1024)/200 = 512

Based on the above calculation we can set below 2 properties

spark.sql.files.maxPartitionBytes = 200000000

spark.sql.shuffle.partitions = 512

Conclusion :

Configurations as per our discussion

Total size : 1200GB

Total number of nodes : (20+1)

Partition size: 200 MB?

Cores per executor : 3

Number of executors per instance: 2

Total Number of executors: 40

Executor and driver memory: 28 GB

Executor and driver memory overhead :3 GB

Number of partitions: 1200000/200 =6000

Do let me know which machine you have used and what was the total elapsed time for your ETL. you may reach out to me for more details

Optimize your EMR cluster

Dinesh Gupta

Engineering Leader

领英推荐

社区洞察

其他会员也浏览了

Contribute Limited Amount Of Storage Of DataNode In Hadoop Cluster

Integrating LVM with Hadoop and providing Elasticity to DataNode Storage

Hadoop Architecture

INTEGRATION OF LVM Partition WITH HADOOP CLUSTER

Procedure To Run Tensor-flow on Hadoop

HOW TO ANALYZE BIG DATA WITH HADOOP TECHNOLOGIES

Let’s research and the world the know about the Myths of Hadoop

Integrating LVM with Hadoop and providing Elasticity to DataNode Storage

How Client put the file, read the file in Hadoop Cluster and How it retrieve data when DataNode is crashed