Optimize your EMR cluster
Below spark, properties are really important to boost cluster performance?
Number of cores per executor
Assigning a lower number of virtual cores leads to a high number of executors, and causes a larger amount of I/O operations.
Based on the historical and bulk of records, five virtual cores for each executor leads to better results in any size of the cluster,
Example?:
for?a 5.2xlarge machine with 8vCPUs you may opt for?3 cores per executor.
Number of executors per instance
One virtual core out of the 8vCPUs is reserved for Hadoop daemons, hence 7vCPUs are used for processing.
?The number of executors per instance :
(total number of virtual cores per instance - 1)/ spark.executors.cores
?The number of executors per instance =(8-1)/3=7/3= 2 approx.
Driver and Executor memory
The driver and executor memory can be decided as below.
Leave 1 GB for the Hadoop daemons.
Total executor memory :
(Total RAM per instance- 1GB for Hadoop daemons)/ number of executors per instance
Total executor memory = (64-1)/2=63/2= 31GB
Here out of 31Gb, 90% memory is allocated to executors and 10% for yarn executor memory overhead
Hence,
Spark executors memory = total executor memory * 0.90
Spark executors memory = 31 * 0.9 = 27.9 => 28GB
spark.executor.memoryOverhead = total executor memory * 0.10
spark.executor.memoryOverhead = 31 * 0.1 = 3GB
It can be further divided as? below
Spark.executor.memory & spark.driver.memory 28GB
spark.driver.memoryOverhead & spark.executor.memoryOverhead 3GB
领英推荐
Spark partition size
?If we consider partition size as 200MB
The number of partitions will be calculated?based on the below 2 parameters
1. size of input data?
2.partition size
So the number of partitions: Total input data size/partition size
something as below :
Size of each partition 200MB
The total size of input 100GB
Number of partitions (100*1024)/200 = 512
Based on the above calculation we can set below 2 properties
spark.sql.files.maxPartitionBytes = 200000000
spark.sql.shuffle.partitions = 512
Conclusion :
Configurations as per our discussion
Total size : 1200GB
Total number of nodes : (20+1)
Partition size: 200 MB?
Cores per executor : 3
Number of executors per instance: 2
Total Number of executors: 40
Executor and driver memory: 28 GB
Executor and driver memory overhead :3 GB
Number of partitions: 1200000/200 =6000
Do let me know which machine you have used and what was the total elapsed time for your ETL. you may reach out to me for more details