Apache Spark Transformations and Actions
Sai Prasad Padhy
Senior Big Data Engineer | Azure Data Engineer | Hadoop | PySpark | ADF | SQL
In this detailed guide, we'll explore Transformations and Actions in detail, breaking down the complexities and providing easy examples to understand these concepts better.
Understanding Spark Transformations
What are Transformations?
Transformations in Apache Spark are operations that create a new data frame from an existing one. Think of transformations as the recipe steps in cooking. You have your raw ingredients (data), and with each transformation, you mix, filter, or reshape them to create a new dish (new DF).
There are two types are transformations
Narrow Transformation
Transformations that do not result in data movement between partitions are called Narrow transformations.
Some Examples:
Wide Transformation
Transformations that involve data movement between partitions are called Wide transformations or shuffle transformations.
Some Examples:
领英推荐
Complete list of transformations - https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations
Understanding Spark Actions
What are Actions?
Actions, on the other hand, are operations that trigger the execution of transformations, producing a result or side effect. Going back to our cooking analogy, actions are like turning on the oven to bake the final dish after all the preparation.
Some Examples:
Complete list of actions - https://spark.apache.org/docs/latest/rdd-programming-guide.html#actions
When Spark Consumes Resources
Lazy Evaluation
Spark operates on a principle called lazy evaluation. It doesn't execute transformations immediately but rather keeps track of them in a plan. It only springs into action when an action is called. Imagine creating a shopping list for your recipes – you plan everything first before hitting the store.
How Spark Remembers Transformations During Actions
Spark maintains a logical execution plan, known as the Directed Acyclic Graph (DAG). When an action is invoked, Spark refers to this plan to understand the sequence of transformations required to produce the final result. It's similar to following a cooking recipe step by step to create a delicious dish.
Conclusion:
Apache Spark's Transformations and Actions are like following a cooking recipe to prepare a delightful feast. Transformations are the recipe steps, and actions are the moments you put the plan into motion. Spark's lazy evaluation and DAG ensure efficient resource usage, making big data processing straightforward.