Optimize your EMR cluster

Optimize your EMR cluster

Below spark, properties are really important to boost cluster performance?

  • Number of cores per executor
  • Number of executors per instance
  • Driver and executor memory
  • Spark partition size


Number of cores per executor

Assigning a lower number of virtual cores leads to a high number of executors, and causes a larger amount of I/O operations.

Based on the historical and bulk of records, five virtual cores for each executor leads to better results in any size of the cluster,

Example?:

for?a 5.2xlarge machine with 8vCPUs you may opt for?3 cores per executor.

Number of executors per instance

One virtual core out of the 8vCPUs is reserved for Hadoop daemons, hence 7vCPUs are used for processing.

?The number of executors per instance :

(total number of virtual cores per instance - 1)/ spark.executors.cores

?The number of executors per instance =(8-1)/3=7/3= 2 approx.

Driver and Executor memory

The driver and executor memory can be decided as below.

Leave 1 GB for the Hadoop daemons.

Total executor memory :

(Total RAM per instance- 1GB for Hadoop daemons)/ number of executors per instance

Total executor memory = (64-1)/2=63/2= 31GB

Here out of 31Gb, 90% memory is allocated to executors and 10% for yarn executor memory overhead

Hence,

Spark executors memory = total executor memory * 0.90

Spark executors memory = 31 * 0.9 = 27.9 => 28GB

spark.executor.memoryOverhead = total executor memory * 0.10

spark.executor.memoryOverhead = 31 * 0.1 = 3GB

It can be further divided as? below

Spark.executor.memory & spark.driver.memory 28GB

spark.driver.memoryOverhead & spark.executor.memoryOverhead 3GB

Spark partition size

?If we consider partition size as 200MB

The number of partitions will be calculated?based on the below 2 parameters

1. size of input data?

2.partition size

So the number of partitions: Total input data size/partition size

something as below :

Size of each partition 200MB

The total size of input 100GB

Number of partitions (100*1024)/200 = 512

Based on the above calculation we can set below 2 properties

spark.sql.files.maxPartitionBytes = 200000000

spark.sql.shuffle.partitions = 512


Conclusion :

Configurations as per our discussion

Total size : 1200GB

Total number of nodes : (20+1)

Partition size: 200 MB?

Cores per executor : 3

Number of executors per instance: 2

Total Number of executors: 40

Executor and driver memory: 28 GB

Executor and driver memory overhead :3 GB

Number of partitions: 1200000/200 =6000

Do let me know which machine you have used and what was the total elapsed time for your ETL. you may reach out to me for more details


要查看或添加评论,请登录

社区洞察

其他会员也浏览了