Spark Performance Optimization: Spark Advance Variables
Rahul Chanda
MTech Data Science | Big Data Engineer |Apache Spark|Hadoop (HDFS)|Python|Microsoft Azure|Databricks
Broadcast Variables:
Broadcast variables are read only variable and it is broadcasted to all the executors. These variables are cached in each node (Block Managers in executors) in the cluster, instead of being passed to each task a copy.
They can only be defined/modified on the Driver side, not on the Executor side.
This variable makes your small data set available on each node, and that node and data will be treated locally for the process which will tuning your spark job performance.?
They are also used to cache variables in nodes in the cluster to avoid Executors to pull data multiple times.
Broadcast join is used to join two tables in which one table is smaller than other table and it can be fit in the memory of each node. This will make the join operation much faster than Sort-merge Join.
df1.join(broadcast(df2))
Accumulator Variables:
Accumulators are used for sharing operations on a variable on all executors, such as implementing counters or aggregating functions.
These variables are defined and assigned initial values on the Driver side. The accumulator can only read the last value on the Driver side and update it on the Excutor side after the action is triggered.
By default Spark supports two numeric accumulators by default:
a. Long Accumulator
b. Double Accumulator
Developers can add customized Accumulators by inheriting AccumulatorV2.
Please let me know your thoughts and comment if any point is missed.