登录查看更多内容

What are: Job, Stage, and Task in Apache Spark

Aman Dahiya

Senior Azure Data Engineer | Microsoft Certified-Azure Data Engineer| Ajure|Databricks | Pyspark | SQL |Python-Pandas|snowflake|Kafka|Hive|Hadoop

发布日期: 2025年3月5日

+ 关注

1 Action =1 Job

1 Job = sequence of transformations on data

1 transformation = No of Stages

1 stage =No of Task

Concept of Job in Spark

A job in Spark refers to a sequence of transformations on data. Whenever an action like count(), first(), collect(), and save() is called on RDD (Resilient Distributed Datasets), a job is created. A job could be thought of as the total work that your Spark application needs to perform, broken down into a series of steps.

Consider a scenario where you’re executing a Spark program, and you call the action count() to get the number of elements. This will create a Spark job. If further in your program, you call collect(), another job will be created. So, a Spark application could have multiple jobs, depending upon the number of actions.

Concept of Stage in Spark

A stage in Spark represents a sequence of transformations that can be executed in a single pass, i.e., without any shuffling of data. When a job is divided, it is split into stages. Each stage comprises tasks, and all the tasks within a stage perform the same computation.

The boundary between two stages is drawn when transformations cause data shuffling across partitions. Transformations in Spark are categorized into two types: narrow and wide. Narrow transformations, like map(), filter(), and union(), can be done within a single partition. But for wide transformations like groupByKey(), reduceByKey(), or join(), data from all partitions may need to be combined, thus necessitating shuffling and marking the start of a new stage.

Concept of Task in Spark

A task in Spark is the smallest unit of work that can be scheduled. Each stage is divided into tasks. A task is a unit of execution that runs on a single machine. When a stage comprises transformations on an RDD, those transformations are packaged into a task to be executed on a single executor.

For example, if you have a Spark job that is divided into two stages and you’re running it on a cluster with two executors, each stage could be divided into two tasks. Each executor would then run a task in parallel, performing the transformations defined in that task on its subset of the data.

In summary, a Spark job is split into multiple stages at the points where data shuffling is needed, and each stage is split into tasks that run the same code on different data partitions.

要查看或添加评论，请登录

Aman Dahiya的更多文章

Simple ways to improve your PySpark and Parquet pipeline performance

2024年11月1日

Simple ways to improve your PySpark and Parquet pipeline performance

When creating data pipelines, it’s easy to get caught in a web of decisions about how to make them faster or more…
Databricks Delta Live Table (DLT): Turning SQL Queries into Pipelines

2024年7月25日

Databricks Delta Live Table (DLT): Turning SQL Queries into Pipelines

Delta Live Tables (DLT) is a framework within Databricks that simplifies the development and management of data…

2 条评论
Delta Lake with Python: How to Use Delta Lake Without Spark

2024年7月25日

Delta Lake with Python: How to Use Delta Lake Without Spark

In today’s data landscape, many engines support the Delta Lake format. The classic setup requires Apache Spark to…
Azure Storage Account : The Nuances

2024年7月23日

Azure Storage Account : The Nuances

Azure Storage Account is a cloud storage solution provided by Microsoft Azure. It offers a scalable and durable…
SCD2 – Implementing Slowly Changing Dimension Type 2 in PySpark

2024年7月21日

SCD2 – Implementing Slowly Changing Dimension Type 2 in PySpark

Time Machine Analogy Image: A vintage clock face superimposed on a modern digital clock, both displaying different…

1 条评论
SCD1 – Implementing Slowly Changing Dimension Type 1 in PySpark

2024年7月21日

SCD1 – Implementing Slowly Changing Dimension Type 1 in PySpark

Introduction to SCD Type 1 SCD Type 1 is a basic method for managing changes to dimension data. In SCD Type 1, when…
Spark-Beyond Basics: Liquid Clustering in Delta tables

2024年7月17日

Spark-Beyond Basics: Liquid Clustering in Delta tables

Before we discuss anything about Liquid Clustering, you have to know about Z-Ordering. (don't you worry, I got you ??…
How to Become Better at Problem Solving with LeetCode

2024年7月5日

How to Become Better at Problem Solving with LeetCode

For anybody who doesn’t know about LeetCode. It’s a platform for developers to practice solving coding interview…
Why We Use if __name__ == ‘__main__’ in Python

2024年7月5日

Why We Use if __name__ == ‘__main__’ in Python

Understanding name Let’s say we have a Python script If we run using the command , the variable for will automatically…
Z-ordering or Z-encoding in pyspark

2024年6月26日

Z-ordering or Z-encoding in pyspark

In PySpark, Z-order (also known as Z-ordering or Z-encoding) is a technique used for optimizing the performance of…

See all articles

Concept of Job in Spark

Concept of Stage in Spark

Concept of Task in Spark

Aman Dahiya的更多文章

Simple ways to improve your PySpark and Parquet pipeline performance

Databricks Delta Live Table (DLT): Turning SQL Queries into Pipelines

Delta Lake with Python: How to Use Delta Lake Without Spark

Azure Storage Account : The Nuances

SCD2 – Implementing Slowly Changing Dimension Type 2 in PySpark

SCD1 – Implementing Slowly Changing Dimension Type 1 in PySpark

Spark-Beyond Basics: Liquid Clustering in Delta tables

How to Become Better at Problem Solving with LeetCode

Why We Use if __name__ == ‘__main__’ in Python

Z-ordering or Z-encoding in pyspark

Why We Use if name == ‘main’ in Python