To Process 25GB of data in Spark.
To Process 25GB of data in Spark.
How many executor CPU cores are required to process 25GB of data?
Reverse Engineering
25GM = 25*1024MB = 25600MB
Number of Partitions = 25600MB/128MB = 200
Number of CPU cores = Number of Partitions = 200
Note: By default spark creates one partition for each block of the file (blocks being 128mb by default in HDFS). But you can also ask for a higher number of partitions by passing a target value.
How many executors are required to process 25GM Data?
Note: To get the better job performance in spark, researchers have found that we can take 2 to 5 maximum core for each executors.
Avg CPU cores for each executor = 4
Total number of executor : 200/4 = 50
How much each executor memory is required to process 25 GM of data?
Note: Expected memory for each core = Minimum 4*(Default Partition Size) = 4*128 MB = 512 MB
Expected memory is not less than 1.5 times of spark reserved memory (Single core executor memory should not be less than 450MB)
CPU Cores for each executor = 4
Memory for each executor = 4*512MB = 2GB
What is the total memory required to process 25 GB of data?
Total number of executor = 50
Memory for each executor = 2GB
Total Memory for all the executor = 50*2 GB = 100 GB
Senior Data Engineer @ Rakuten | Data Migration, Lakehouse Building
5 个月Please Correct the typo , it should be 128 instead of 126 Venkata Polepalli
Data Engineer | Spark | SQL | Python | AWS | Databricks
5 个月Here we have taken CPU cores for each executor as 4 and no. of executors as 50. Is it correct that both of these values can be tuned by us to make the job fast / slow if resources are available.