登录查看更多内容

"Spark Performance Tuning with help of Spark UI"

Abhishek Singh

Technical Lead Data Engineer Azure at Publicis Sapient. Expertise in SQL, Pyspark and Scala with Spark, Kafka with Spark Streaming, Databricks, and Data Tuning Spark Application for PetaByte. Cloud AWS, Azure and GCP

发布日期: 2022年9月22日

Spark is distributed data processing engine which relies a lot on memory available for computation. Also, if you have worked on a spark, then you must have faced job/task/stage failures due to memory issues. Hence making memory management is one of the key techniques for an efficient Spark environment. In this post, we will see how Spark UI can help in understanding your jobs better thereby making performance tuning a process rather than going for the trial & error method on an ad hoc basis.

I know many new spark developers who completely ignore Spark UI and they think they can understand everything the job throws at the console. My first recommendation is to start using Spark UI and keep it open all the time even when you are not facing any memory issues. As a spark developer, you must be comfortable with Spark UI as it gives so much information about your job which you may not know or never bother to know.

Let's quickly look at some of the tabs you can see in Spark UI and what information it can give to developers.

1. Jobs

This is the first tab, and it gives a lot of summarized information on jobs executed in the environment. Some information you can get here is the number of Jobs completed, the duration for each Job, Job consisted how many stages & tasks. If you need info on executors, then just click on "Event Timeline" to see a graphical view of the executor’s addition and removal. Considering that going forward, we may want to use dynamic allocation for executors, this can tell you a lot about how many executors the system allocated for your job.

In the above screenshot, we can see?2 Executors (Executors 3 & 4) created for this job. Also, it took 46 seconds to complete the job with 6 stages and 623 Tasks.

2. Stages

You can get so much information about each stage that was created for the job on this page. Let's look at the screenshot below:

You can get the following information from this page:

1)?Number of Completed Stages. In this case, it is 6.

2)?Duration for each stage. Also, you can see that the 3 stages ran in parallel as it starts at the same time.

3)?Table Scan Volume: In this job, spark reads data from the S3 location. So for S3, the default block size is 64 MB & 128 MB depending on the spark environment. In this case, it is 64 MB. So if you divide the file size by 64 MB, you can see the logical partitions are the same as tasks created in the Table Scan stage.?Example : 547.8/64 = ~9 blocks. Hence 9 tasks were created in this stage

4)?Shuffle Read/Write. One can also check how much volume of data was Read or Written during the redistribution of data or in other words Shuffle phase. This is very important information as it allows one to identify the proper number of shuffle partitions required for this job.?The default value of shuffle partitions (spark.sql.shuffle.partitions)?is 200 which is clearly way too much for the job in this example. You can change a number of shuffle partitions anytime in the job.

3. Environment

Check for all the default values and custom values set for various configuration parameters that your job used for execution. This gives a lot of information and you should be well aware of a few key parameters related to executors, drivers, memory management, shuffle partitions, etc.

Azrul MADISA 3 年前

DATA Pill #054 - 10 best open-source repos, LLM…

Adam Kawa 1 年前

Fast Kullback-Leibler Divergence Using Spark

Patrick Nicolas 8 个月前

4. SQL

It gives you the detailed DAG (Direct Acyclic Graph) for the query. This directly refer to the execution plan optimizer prepared to complete this job. At every step you can see how many rows are impacted and what operation/transformation was applied to RDD. Let's look at the screenshot below

You can classify any SQL tuning activity into broadly 3 categories: efficient File/Table Scan, Optimizing Joins & well-planned other operations like Aggregation, Filters, etc.

Efficient file/table scan

1.?????you can try using compression to reduce the size.

2.??????convert to other optimized formats like parquet.

3.??????split a single file into multiple parts however avoid small file parts.

4.??????increase the block size to create fewer tasks

Optimizing Joins

1.?????Broadcast Join?is the recommended one whenever possible. Reducing the scan size can directly help in accommodating more tables to be broadcasted. Also, the default value (spark.sql.autoBroadcastJoinThreshold) is 10MB which can be increased to include more bigger tables. However, avoid broadcasting big tables as it may result in errors. Sometimes merge join is better and infallible.

2.??????Most of the tuning techniques applicable to other RDBMS are also true in Spark like partition pruning, using buckets, avoiding operations on joining columns, etc.

3.??????Shuffle Strategy?we have seen earlier in the post that the default value of 200 partitions was way too much for the shuffle data volume. So?keeping optimized value for shuffle partitions can make the most significant improvement in query performance.?The single partition cannot be more than 2 GB else it will result in a fatal error.

Number of Cores & Executors

Setting the right value of cores & executors is a never-ending discussion. You can find a lot of posts on web which will tell the math's to come up with right number of executors & memory. However, practically this may not help a lot. Sometimes you may need a "fat" executor while most of the time "thin" executor shall work perfectly fine. In short, one configuration may not fit all situations. So you may have to do some trial & error method here. I will prefer to keep dynamic allocation enabled and cores per executors to 4 and then play with executor memory for your job.

Let's look at the screenshot below which shows the change in DAG after doing 2 changes – converting file type to parquet from CSV and increasing broadcast join threshold value.

Hope this post helps you in your Spark tuning activity. Feel free to share more tuning exercise which you generally do in Spark with us.

"Spark Performance Tuning with help of Spark UI"

Abhishek Singh

Technical Lead Data Engineer Azure at Publicis Sapient. Expertise in SQL, Pyspark and Scala with Spark, Kafka with Spark Streaming, Databricks, and Data Tuning Spark Application for PetaByte. Cloud AWS, Azure and GCP

1. Jobs

2. Stages

3. Environment

领英推荐

4. SQL

Efficient file/table scan

Optimizing Joins

Number of Cores & Executors

更多精彩文章

社区洞察

其他会员也浏览了

SPARK - Partitioning

?? DATA Pill #110 - Optimizing Flink SQL, Let's reproduce GPT-2

?? Ridge vs. Lasso: Tuning Models for Stock Markets ??

Handling Big Data with XGBoost and Azure Databricks: From EDA to Deployment

Pyspark Scenario based Realtime questions

End-To-End Data Processing

Best Practices and Spark optimisation Tips for Data engineers

Generating 1 Billion Rows of Complex Synthetic Data ??

AIML23- Handling Large Data in Less Memory- Part-01

A Journey into the World of Data Structures and Algorithms: Unveiling the Magic Behind Software

1. Jobs

2. Stages

3. Environment

领英推荐

4. SQL

Efficient file/table scan

Optimizing Joins

Number of Cores & Executors

Interview Question for Lead Data Engineer at MAANG

2024年4月9日

SQL Server Big Data Clusters on Azure

2022年9月30日

Uber System Architecture Design

2022年9月30日

"Key Concepts, to Master Window Functions"

2022年9月28日

"Real-Time End-to-End Integration with Apache Kafka in Apache Spark’s Streaming"

2022年9月27日

Netflix High-Level System Architecture

2022年9月24日

"How to improve SQL as a Senior Data Engineer"

2022年9月24日

What is the difference between a data lake and a data warehouse?

2022年9月23日

Developing a Real-Time Data Warehouse

2022年9月23日

Change Data Capture Using Kafka Debezium and PostgreSQL

2022年9月22日

社区洞察

其他会员也浏览了

SPARK - Partitioning

?? DATA Pill #110 - Optimizing Flink SQL, Let's reproduce GPT-2

?? Ridge vs. Lasso: Tuning Models for Stock Markets ??

Handling Big Data with XGBoost and Azure Databricks: From EDA to Deployment

Pyspark Scenario based Realtime questions

End-To-End Data Processing

Best Practices and Spark optimisation Tips for Data engineers

Generating 1 Billion Rows of Complex Synthetic Data ??

AIML23- Handling Large Data in Less Memory- Part-01

A Journey into the World of Data Structures and Algorithms: Unveiling the Magic Behind Software