ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

What is a DAG?

Omar Khaled

BI & Big Data Quality Tech Specialist at Vodafone

å‘å¸ƒæ—¥æœŸ: 2024å¹´7æœˆ10æ—¥

A DAG, or Directed Acyclic Graph, is a conceptual representation of the series of computations performed by Spark. In Spark, DAGs are used to represent both the sequence of transformations applied to the data and the execution plan to be carried out across the distributed data processing framework.

Key Components of a DAG

Vertices: Represent RDDs or DataFrames resulting from transformations.
Edges: Represent the operations (transformations or actions) applied to the data.

How DAG Works in PySpark

Transformation and Action: When you define transformations (e.g., map, filter, groupBy) on RDDs or DataFrames, Spark does not execute them immediately. Instead, it builds a logical execution plan in the form of a DAG.
Lazy Evaluation: Spark evaluates the DAG lazily. This means transformations are not executed until an action (e.g., collect, count, saveAsTextFile) is called. This allows Spark to optimize the execution plan.
Job and Stages: When an action is called, Sparkâ€™s scheduler divides the DAG into a series of stages. Each stage consists of tasks that can be executed in parallel. Stages are separated by shuffle operations (data redistributions across nodes).

Example of DAG in PySpark

Let's consider an example to illustrate how a DAG is built and executed in PySpark.

from pyspark.sql import SparkSession

# Initialize SparkSession
spark = SparkSession.builder.appName("DAG Example").getOrCreate()

# Create a DataFrame
data = [("Alice", 1), ("Bob", 2), ("Cathy", 3), ("David", 4)]
df = spark.createDataFrame(data, ["name", "value"])

# Define transformations
df_filtered = df.filter(df["value"] > 1)
df_squared = df_filtered.withColumn("squared", df_filtered["value"] * df_filtered["value"])

# Define an action
result = df_squared.collect()
print(result)

é¢†è‹±æŽ¨è

HOW TO WRITE CLEAN CODE: A DATA SCIENTISTS GUIDE

Russ K. 2 ä¸ªæœˆå‰

Building an AI-Powered Chatbot with Gemini 1.5 & Snowflake SQL Execution

Building an AI-Powered Chatbot with Gemini 1.5 &â€¦

Prabhat Pankaj 1 ä¸ªæœˆå‰

DATA Pill #054 - 10 best open-source repos, LLM, Flink and Apache Iceberg + Python

DATA Pill #054 - 10 best open-source repos, LLMâ€¦

Adam Kawa 1 å¹´å‰

DAG Construction in the Example

Initial DataFrame (df): This is the initial RDD.
Filter Transformation (df_filtered): A transformation that filters rows where value > 1.
Map Transformation (df_squared): A transformation that adds a new column squared.

When you call collect(), Spark builds a DAG representing these transformations. The DAG for the above example might look like this:

Initial DataFrame
       |
    filter (value > 1)
       |
    withColumn (squared)
       |
    collect

Execution Plan

Stage 1: Apply the filter transformation.
Stage 2: Apply the withColumn transformation.
Stage 3: Execute the collect action.

Each stage consists of tasks that are distributed and executed across the cluster nodes.

Benefits of DAG in PySpark

Optimization: Spark can optimize the execution plan by understanding the entire sequence of transformations. It can, for instance, minimize data shuffling or combine transformations.
Fault Tolerance: Since the DAG represents the logical execution plan, Spark can recompute lost data using lineage information if a node fails.
Parallel Execution: The DAG allows Spark to execute tasks in parallel, improving performance and resource utilization.

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Omar Khaledçš„æ›´å¤šæ–‡ç«

Apache Spark: Key Advantages Over Hadoop and the Power of Lineage-Based Recovery

2024å¹´10æœˆ25æ—¥

Apache Spark: Key Advantages Over Hadoop and the Power of Lineage-Based Recovery

Apache Spark is an open-source, distributed computing framework that provides high-speed, scalable, and versatile dataâ€¦
Hadoop Ecosystem

2024å¹´10æœˆ22æ—¥

Hadoop Ecosystem

Hadoop is a powerful open-source framework that enables distributed storage and processing of large datasets usingâ€¦

2 æ¡è¯„è®º
SQL Query Optimization: Key Techniques for Boosting Performance at Both the Query and Source Level

2024å¹´10æœˆ15æ—¥

SQL Query Optimization: Key Techniques for Boosting Performance at Both the Query and Source Level

Optimizing SQL Query from Your Side (Query-Level Optimization) Here are some key techniques to optimize SQL performanceâ€¦

1 æ¡è¯„è®º
A Comprehensive Guide to CSV Files vs. Parquet Files in PySpark

2024å¹´10æœˆ3æ—¥

A Comprehensive Guide to CSV Files vs. Parquet Files in PySpark

When working with large-scale data processing in PySpark, understanding the differences between data formats like CSVâ€¦
Stored Procedures Vs Functions

2024å¹´9æœˆ23æ—¥

Stored Procedures Vs Functions

1. What is a Stored Procedure? A stored procedure is a precompiled collection of SQL statements and optionalâ€¦
Overview of Data Architectures

2024å¹´9æœˆ2æ—¥

Overview of Data Architectures

In the realm of data management, the evolution of data architectures has been driven by the need to handle increasingâ€¦
Why We Need a Data Warehouse

2024å¹´8æœˆ15æ—¥

Why We Need a Data Warehouse

A data warehouse (DWH) and a traditional operational database (OLTP, Online Transaction Processing) serve differentâ€¦
The na.replace function in PySpark

2024å¹´8æœˆ1æ—¥

The na.replace function in PySpark

The na.replace function in PySpark provides a convenient way to replace specific values in a DataFrame's columns.
Implicit type casting is an easy way to shoot yourself in the foot

2024å¹´8æœˆ1æ—¥

Implicit type casting is an easy way to shoot yourself in the foot

The phrase "Implicit type casting is an easy way to shoot yourself in the foot" refers to the potential dangers andâ€¦
3 Ways to Filter Data Based on String in PySpark

2024å¹´7æœˆ30æ—¥

3 Ways to Filter Data Based on String in PySpark

When working with large datasets in PySpark, filtering data based on string values is a common operation. Whetherâ€¦

See all articles

What is a DAG?

Omar Khaled

BI & Big Data Quality Tech Specialist at Vodafone

Key Components of a DAG

How DAG Works in PySpark

Example of DAG in PySpark

é¢†è‹±æŽ¨è

DAG Construction in the Example

Execution Plan

Benefits of DAG in PySpark

Omar Khaledçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Data Science End to End: From data wrangling to machine learning to model deployment as micro-service API

DATA Pill #046 - Is the Data Engineer dead? And how Fivetran + dbt fail?

FLAIV-KING Weekly (Flink AI Vectors Kafka) for 18 Nov 2024

DATA Pill #074 - LLMs for Evil, Kedro Dynamic Pipelines, 7 tips for writing better GitLab pipelines

CLASSIFICATION OF DATA STRUCTURE

DATA Pill #056 - Fine Tuning vs. Prompt Engineering LLM, Kedro-Snowflake plugin, and moreâ€¦

?? Ridge vs. Lasso: Tuning Models for Stock Markets ??

End-To-End Data Processing

?? DATA Pill #111 - Stream enrichment with Flink SQL, Ray Infrastructure

?? DATA Pill #143 - ETL is Dead, The Golden Path Revolution

Key Components of a DAG

How DAG Works in PySpark

Example of DAG in PySpark

é¢†è‹±æŽ¨è

DAG Construction in the Example

Execution Plan

Benefits of DAG in PySpark

Omar Khaledçš„æ›´å¤šæ–‡ç«

Apache Spark: Key Advantages Over Hadoop and the Power of Lineage-Based Recovery

Hadoop Ecosystem

SQL Query Optimization: Key Techniques for Boosting Performance at Both the Query and Source Level

A Comprehensive Guide to CSV Files vs. Parquet Files in PySpark

Stored Procedures Vs Functions

Overview of Data Architectures

Why We Need a Data Warehouse

The na.replace function in PySpark

Implicit type casting is an easy way to shoot yourself in the foot

3 Ways to Filter Data Based on String in PySpark

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Data Science End to End: From data wrangling to machine learning to model deployment as micro-service API

DATA Pill #046 - Is the Data Engineer dead? And how Fivetran + dbt fail?

FLAIV-KING Weekly (Flink AI Vectors Kafka) for 18 Nov 2024

DATA Pill #074 - LLMs for Evil, Kedro Dynamic Pipelines, 7 tips for writing better GitLab pipelines

CLASSIFICATION OF DATA STRUCTURE

DATA Pill #056 - Fine Tuning vs. Prompt Engineering LLM, Kedro-Snowflake plugin, and moreâ€¦

?? Ridge vs. Lasso: Tuning Models for Stock Markets ??

End-To-End Data Processing

?? DATA Pill #111 - Stream enrichment with Flink SQL, Ray Infrastructure

?? DATA Pill #143 - ETL is Dead, The Golden Path Revolution

é¢†è‹±æŽ¨è

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†