Broadcast Variables
#sparkday10of30
What is the purpose of broadcast variable in spark?
Broadcast Variables
Purpose: Broadcast variables allow you to efficiently share a read-only variable with all the nodes in your Spark cluster. They are particularly useful when you have a large dataset that needs to be used across multiple stages of your computation, and you want to avoid the overhead of shipping this data with every task.
Use Case: A common use case for broadcast variables is in join operations, where you want to broadcast a smaller dataset to avoid shuffling.
Example: Suppose you have a large RDD largeRdd and a small lookup table smallDataFrame that you want to join.
sales_df = spark.createDataFrame([(101, 100),(102, 150),(103, 200),(104, 250)]
, ["product_id", "amount"])
product_lookup_df = spark.createDataFrame([(101, "Product A"),(102, "Product B"),(103, "Product C"),(104, "Product D")]_lookup_data, ["product_id", "product_name"])
?
# Broadcast the product lookup DataFrame
broadcasted_product_lookup_df = broadcast(product_lookup_df)
?
# Perform a broadcast join between sales_df and broadcasted_product_lookup_df
joined_df = sales_df.join(broadcasted_product_lookup_df, on="product_id", how="inner")
?
# Show the joined DataFrame
joined_df.show()
Advantages: