SPARK - Partitioning

SPARK - Partitioning

Why It Matters

Performance is greatly impacted if a developer does not consider infrastructure while developing a SPARK application. The technical cost includes increased runtime (slow speed), difficult debugging, and non-optimal utilization of cluster resources. Increased cost will come downstream if we do not consider SPARK best practices. Cost reduction is vital to the business – better utilization of the cluster will help to reduce idle resources and eliminate the number of DPUs. ???

Partition Size

Questions to answer before deciding the optimal partition count:

  1. How many threads are available?
  2. What is the current partition count of the DataFrame?
  3. What is the approximate partition size of the DataFrame?

Solutions

  • Aim for numbPartitions =(threads available * 2) or 2-3 times per executor CPU available.
  • Another recommendation, 50-200MB (depending on use case) per partition. Research suggests 128 MB allows a Spark job’s to achieve?optimal parallelism.

* For less complex Spark apps (think jobs with only narrow transformations) it is suffice to use the default number of partitions. Spark creates one partition for each block of the file which is 128 MB per partition (if the file is splittable). The following code checks the number of partitions: df.rdd.getNumPartitions(). This number should be used for to estimate the correct number of DPUs in AWS Glue to achieve optimization.

Partition on Read

Developers should determine the number of partitions based on the size of the input dataset. Luckily, Spark does this by default very well if your data is explodable. This is dependent on how data is placed in the data storage such as S3/HDFS and how Spark splits the data while reading.

For example, when input data is 50 GB (50000 Mb), with the default block size setting at (128Mb) it would be stored in 391 blocks, which means 391 partitions. Input split is set by the Hadoop InputFormat used to read this file and may be different depending on settings and environment.

?Remember: 1 Partition makes for 1 Task that runs on 1 Core

Partition on Shuffle

Shuffle partitions are partitions that are used at data shuffle for wide transformations (join, groupby, aggregation). For wide transformations, the default number of partitions is set to 200 and will change to this after a wide transformation. This can impact performance if this number is not sufficient for your use case. It is set at 200 based on research that suggested it was a good default. In practice, this value is not always applicable for a developers use case. The piece of code to change this default: spark.sql.shuffle.partitions(). This is the most frequently used parameter when working with wide transformations.

For small amounts of data, developers should reduce the number of shuffle partitions (see partition size above). ?Having many partitions with small Mb of data parsed to each will result in underutilization and increases the time it takes to shuffle over the network. ?If the data is too big spread amongst the partitions, it increases the load on each executor and may lead to out of memory (OOM) issues. Disk spills is another occurrence that will be likely and are the slowest thing a developer can do in spark.

Partition on Write

When writing a DataFrame to disk, developers need to pay attention to partition sizes. During writing Spark produces one file per task and reads at least one file in the task while reading. If the cluster setup, in which the DataFrame wrote to disk, leveraged a g1 cluster (more memory) and could process large partitions sizes, then a smaller cluster (with less memory) will have problems reading that DataFrame.

Lastly, if many smaller files are written, an incurred storage cost occurs that will accumulates overtime. repartition() will increase the number of partitions and your output files. coalesce()?will minimize your partitions and decrease output files. These 2 methods are another way to manage partitions throughout all stages.

?

Paul Furr

Recruitment Manager

6 个月

nagaraju juluru, SPARK - Partitioning sounds intriguing! How do you tackle performance challenges effectively?

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了