登录查看更多内容

SPARK - Partitioning

nagaraju juluru

Hadoop | Hive | Sqoop | PySpark| Spark Streaming | Kafka |AWS(S3,EMR,Ec2,Athena,Glue,Dynamo DB and Redshift) | Databricks | HBASE | Cassandra |Snowflake| Airflow|

发布日期: 2024年4月21日

Why It Matters

Performance is greatly impacted if a developer does not consider infrastructure while developing a SPARK application. The technical cost includes increased runtime (slow speed), difficult debugging, and non-optimal utilization of cluster resources. Increased cost will come downstream if we do not consider SPARK best practices. Cost reduction is vital to the business – better utilization of the cluster will help to reduce idle resources and eliminate the number of DPUs. ???

Partition Size

Questions to answer before deciding the optimal partition count:

How many threads are available?
What is the current partition count of the DataFrame?
What is the approximate partition size of the DataFrame?

Solutions

Aim for numbPartitions =(threads available * 2) or 2-3 times per executor CPU available.
Another recommendation, 50-200MB (depending on use case) per partition. Research suggests 128 MB allows a Spark job’s to achieve?optimal parallelism.

* For less complex Spark apps (think jobs with only narrow transformations) it is suffice to use the default number of partitions. Spark creates one partition for each block of the file which is 128 MB per partition (if the file is splittable). The following code checks the number of partitions: df.rdd.getNumPartitions(). This number should be used for to estimate the correct number of DPUs in AWS Glue to achieve optimization.

Partition on Read

Developers should determine the number of partitions based on the size of the input dataset. Luckily, Spark does this by default very well if your data is explodable. This is dependent on how data is placed in the data storage such as S3/HDFS and how Spark splits the data while reading.

Vincent Granville 1 个月前

Deep Dive into Persist in Apache Spark

Sachin D N ???? 8 个月前

Spark Performance Tuning: Addressing Common Issues and…

Devashish Somani 2 个月前

For example, when input data is 50 GB (50000 Mb), with the default block size setting at (128Mb) it would be stored in 391 blocks, which means 391 partitions. Input split is set by the Hadoop InputFormat used to read this file and may be different depending on settings and environment.

?Remember: 1 Partition makes for 1 Task that runs on 1 Core

Partition on Shuffle

Shuffle partitions are partitions that are used at data shuffle for wide transformations (join, groupby, aggregation). For wide transformations, the default number of partitions is set to 200 and will change to this after a wide transformation. This can impact performance if this number is not sufficient for your use case. It is set at 200 based on research that suggested it was a good default. In practice, this value is not always applicable for a developers use case. The piece of code to change this default: spark.sql.shuffle.partitions(). This is the most frequently used parameter when working with wide transformations.

For small amounts of data, developers should reduce the number of shuffle partitions (see partition size above). ?Having many partitions with small Mb of data parsed to each will result in underutilization and increases the time it takes to shuffle over the network. ?If the data is too big spread amongst the partitions, it increases the load on each executor and may lead to out of memory (OOM) issues. Disk spills is another occurrence that will be likely and are the slowest thing a developer can do in spark.

Partition on Write

When writing a DataFrame to disk, developers need to pay attention to partition sizes. During writing Spark produces one file per task and reads at least one file in the task while reading. If the cluster setup, in which the DataFrame wrote to disk, leveraged a g1 cluster (more memory) and could process large partitions sizes, then a smaller cluster (with less memory) will have problems reading that DataFrame.

Lastly, if many smaller files are written, an incurred storage cost occurs that will accumulates overtime. repartition() will increase the number of partitions and your output files. coalesce()?will minimize your partitions and decrease output files. These 2 methods are another way to manage partitions throughout all stages.

Paul Furr

Recruitment Manager

6 个月

nagaraju juluru, SPARK - Partitioning sounds intriguing! How do you tackle performance challenges effectively?

要查看或添加评论，请登录

查看全部

SPARK - Partitioning

nagaraju juluru

Hadoop | Hive | Sqoop | PySpark| Spark Streaming | Kafka |AWS(S3,EMR,Ec2,Athena,Glue,Dynamo DB and Redshift) | Databricks | HBASE | Cassandra |Snowflake| Airflow|

Why It Matters

Partition Size

Partition on Read

领英推荐

Partition on Shuffle

Partition on Write

更多精彩文章

社区洞察

其他会员也浏览了

Exploring the World of Distributed Computing Frameworks: Empowering Scalable and Efficient Computing

Databricks Virtual Event - The Lakehouse & Careers in Data+AI

Best Practices and Spark optimisation Tips for Data engineers

Bloom Filter Index in Apache Spark: Boosting Query Performance with Probabilistic Magic

Apache Spark 3.0 for Data Scientists : Best Practices

DATA & AI Summit 2023

Unpacking Lazy Evaluation in Apache Spark: A Deep Dive

How you can Reduce Costs of Data Science and MLOps Development Pipelines with k0s and Jupyter Notebooks

Apache Spark Aggregation Methods: Hash-based Vs. Sort-based

7 incredible features of Delta lakes that you need to know

Why It Matters

Partition Size

Partition on Read

领英推荐

Partition on Shuffle

Partition on Write

AWS: IAM (Identity and Access Management)

2024年7月6日

AWS: DynamoDB

2024年7月2日

AWS: Cloud Formation

2024年7月1日

AWS: Basics

2024年6月30日

Types of Machine Learning Models

2024年6月15日

Data Modeling/Dimension Modeling

2024年4月20日

Event Driven AWS Services

2024年4月8日

Function Structure

2024年4月8日

AWS Transfer for SFTP

2024年4月4日

Empowering the Mobile Citizen Developer

2016年10月12日

社区洞察

其他会员也浏览了

Exploring the World of Distributed Computing Frameworks: Empowering Scalable and Efficient Computing

Databricks Virtual Event - The Lakehouse & Careers in Data+AI

Best Practices and Spark optimisation Tips for Data engineers

Bloom Filter Index in Apache Spark: Boosting Query Performance with Probabilistic Magic

Apache Spark 3.0 for Data Scientists : Best Practices

DATA & AI Summit 2023

Unpacking Lazy Evaluation in Apache Spark: A Deep Dive

How you can Reduce Costs of Data Science and MLOps Development Pipelines with k0s and Jupyter Notebooks

Apache Spark Aggregation Methods: Hash-based Vs. Sort-based

7 incredible features of Delta lakes that you need to know