SPARK - Partitioning
nagaraju juluru
Hadoop | Hive | Sqoop | PySpark| Spark Streaming | Kafka |AWS(S3,EMR,Ec2,Athena,Glue,Dynamo DB and Redshift) | Databricks | HBASE | Cassandra |Snowflake| Airflow|
Why It Matters
Performance is greatly impacted if a developer does not consider infrastructure while developing a SPARK application. The technical cost includes increased runtime (slow speed), difficult debugging, and non-optimal utilization of cluster resources. Increased cost will come downstream if we do not consider SPARK best practices. Cost reduction is vital to the business – better utilization of the cluster will help to reduce idle resources and eliminate the number of DPUs. ???
Partition Size
Questions to answer before deciding the optimal partition count:
Solutions
* For less complex Spark apps (think jobs with only narrow transformations) it is suffice to use the default number of partitions. Spark creates one partition for each block of the file which is 128 MB per partition (if the file is splittable). The following code checks the number of partitions: df.rdd.getNumPartitions(). This number should be used for to estimate the correct number of DPUs in AWS Glue to achieve optimization.
Partition on Read
Developers should determine the number of partitions based on the size of the input dataset. Luckily, Spark does this by default very well if your data is explodable. This is dependent on how data is placed in the data storage such as S3/HDFS and how Spark splits the data while reading.
领英推荐
For example, when input data is 50 GB (50000 Mb), with the default block size setting at (128Mb) it would be stored in 391 blocks, which means 391 partitions. Input split is set by the Hadoop InputFormat used to read this file and may be different depending on settings and environment.
?Remember: 1 Partition makes for 1 Task that runs on 1 Core
Partition on Shuffle
Shuffle partitions are partitions that are used at data shuffle for wide transformations (join, groupby, aggregation). For wide transformations, the default number of partitions is set to 200 and will change to this after a wide transformation. This can impact performance if this number is not sufficient for your use case. It is set at 200 based on research that suggested it was a good default. In practice, this value is not always applicable for a developers use case. The piece of code to change this default: spark.sql.shuffle.partitions(). This is the most frequently used parameter when working with wide transformations.
For small amounts of data, developers should reduce the number of shuffle partitions (see partition size above). ?Having many partitions with small Mb of data parsed to each will result in underutilization and increases the time it takes to shuffle over the network. ?If the data is too big spread amongst the partitions, it increases the load on each executor and may lead to out of memory (OOM) issues. Disk spills is another occurrence that will be likely and are the slowest thing a developer can do in spark.
Partition on Write
When writing a DataFrame to disk, developers need to pay attention to partition sizes. During writing Spark produces one file per task and reads at least one file in the task while reading. If the cluster setup, in which the DataFrame wrote to disk, leveraged a g1 cluster (more memory) and could process large partitions sizes, then a smaller cluster (with less memory) will have problems reading that DataFrame.
Lastly, if many smaller files are written, an incurred storage cost occurs that will accumulates overtime. repartition() will increase the number of partitions and your output files. coalesce()?will minimize your partitions and decrease output files. These 2 methods are another way to manage partitions throughout all stages.
?
Recruitment Manager
6 个月nagaraju juluru, SPARK - Partitioning sounds intriguing! How do you tackle performance challenges effectively?