登录查看更多内容

Spark Performance Optimization: Spark Advance Variables

Rahul Chanda

MTech Data Science | Big Data Engineer |Apache Spark|Hadoop (HDFS)|Python|Microsoft Azure|Databricks

发布日期: 2021年6月26日

Broadcast Variables:

Broadcast variables are read only variable and it is broadcasted to all the executors. These variables are cached in each node (Block Managers in executors) in the cluster, instead of being passed to each task a copy.

They can only be defined/modified on the Driver side, not on the Executor side.

This variable makes your small data set available on each node, and that node and data will be treated locally for the process which will tuning your spark job performance.?

They are also used to cache variables in nodes in the cluster to avoid Executors to pull data multiple times.

Broadcast join is used to join two tables in which one table is smaller than other table and it can be fit in the memory of each node. This will make the join operation much faster than Sort-merge Join.

df1.join(broadcast(df2))

Accumulator Variables:

Accumulators are used for sharing operations on a variable on all executors, such as implementing counters or aggregating functions.

These variables are defined and assigned initial values on the Driver side. The accumulator can only read the last value on the Driver side and update it on the Excutor side after the action is triggered.

By default Spark supports two numeric accumulators by default:

a. Long Accumulator

b. Double Accumulator

Developers can add customized Accumulators by inheriting AccumulatorV2.

Please let me know your thoughts and comment if any point is missed.

要查看或添加评论，请登录

Rahul Chanda的更多文章

Git - Devops tool

2022年7月26日

Git - Devops tool

Before GIT was introduced i.e.
Spark Performance Optimization: Partitioning in Spark

2021年7月3日

Spark Performance Optimization: Partitioning in Spark

Partitioning in Spark: While working with big data in distributed processing engine, it becomes necessary to choose the…
Spark Performance Optimization: Data Serialization

2021年6月22日

Spark Performance Optimization: Data Serialization

Serialization is to convert an object to byte stream and the vice versa is for de-serialization. This is very helpful…

1 条评论

Spark Performance Optimization: Spark Advance Variables

Rahul Chanda

MTech Data Science | Big Data Engineer |Apache Spark|Hadoop (HDFS)|Python|Microsoft Azure|Databricks

Rahul Chanda的更多文章

社区洞察

其他会员也浏览了

Why there’s no standard approach for Data Coupling and Control Coupling Analysis

Use of .then() function in Cypress

Data Acquisition: The Engineer’s Favorite DAQ

Memory Overhead

LeetCode 189. Rotate Array

How to decipher NEMDE constraint equation formulations

Stack Memory

A/B Testing, Canary and Shadow deployments for ML models

Discover the performance uplift from switching to Memgraph on your own data using our new benchmark suite

What do we mean by the worst-case performance of an algorithm?

Rahul Chanda的更多文章

Git - Devops tool

Spark Performance Optimization: Partitioning in Spark

Spark Performance Optimization: Data Serialization

社区洞察

其他会员也浏览了

Why there’s no standard approach for Data Coupling and Control Coupling Analysis

Use of .then() function in Cypress

Data Acquisition: The Engineer’s Favorite DAQ

Memory Overhead

LeetCode 189. Rotate Array

How to decipher NEMDE constraint equation formulations

Stack Memory

A/B Testing, Canary and Shadow deployments for ML models

Discover the performance uplift from switching to Memgraph on your own data using our new benchmark suite

What do we mean by the worst-case performance of an algorithm?