Introduction to PySpark
Hemavathi .P
Data Engineer @IBM | DataEngineer |3+ years experience | Hadoop | HDFS | SQL | Sqoop | Hive |PySpark | AWS | AWS Glue | AWS Emr | AWS Redshift | S3 | Lambda
What is Spark?
Apache Spark is an open-source, distributed computing system that provides a fast and general-purpose cluster-computing framework for big data processing. Spark is designed for both batch and real-time data processing and offers an easier, more powerful alternative to MapReduce.
Spark provides in-memory data processing which makes it faster than Hadoop MapReduce, especially for iterative algorithms like machine learning. It supports a wide variety of processing workloads, including:
Spark operates on the Resilient Distributed Dataset (RDD) abstraction and also supports higher-level APIs like DataFrames and Datasets for structured data processing.
Key Components of Spark:
Spark vs. MapReduce
Aspect MapReduce Spark
Programming Model 2 phases: Map & Reduce Supports complex DAGs (Directed Acyclic Graphs)
Data Processing Disk-based (persistent storage) In-memory processing for faster performance
Fault Tolerance Relies on replicating data RDDs track lineage for fault on HDFS tolerance
Ease of Use Low-level, harder to program Higher-level APIs (RDD, DataFrame, Dataset)
Performance Slower due to disk-based Faster due to in-memory operations processing
Streaming Not designed for real-time Spark Streaming for real-time processing data processing
Summary: Spark is faster, more flexible, and easier to program than MapReduce due to its in-memory processing and higher-level APIs.
What is RDD (Resilient Distributed Dataset)?
An RDD is a fundamental data structure in Spark. It is an immutable distributed collection of objects that can be processed in parallel across a cluster. RDDs are fault-tolerant, meaning if a node fails, Spark can recompute the lost data using the lineage graph.
Key Properties of RDDs:
Creating RDDs:
RDDs can be created by:
# Example: Create RDD from a Python list
rdd = sc.parallelize([1, 2, 3, 4, 5])
Lineage Graph
The lineage graph represents the sequence of transformations applied to an RDD. It records the series of operations performed on the RDD, making it possible to recover lost data by recomputing it from the original source. The lineage graph enables fault tolerance in Spark by allowing the system to recover lost partitions or RDDs due to node failure.
Directed Acyclic Graph (DAG)
A DAG is a representation of a Spark job’s execution plan. It is a directed, acyclic graph that represents the sequence of computations (stages) that must occur for the job to complete. Each node in the graph corresponds to a stage or task, and edges represent data dependencies between stages.
How DAG Works:
Transformations and Actions
Transformations:
Transformations are lazy operations that define a computation on an RDD but do not immediately trigger execution. Instead, they create a new RDD representing the transformed data. The actual computation is triggered when an action is invoked.
Types of Transformations:
Examples of Transformations:
Actions:
Actions trigger the execution of the RDD transformations and return a result to the driver or write data to an external storage system. Once an action is invoked, Spark computes the required transformations in a lazy manner.
Examples of Actions:
Narrow vs. Wide Transformations
rdd = rdd.map(lambda x: x * 2)
rdd = rdd.reduceByKey(lambda a, b: a + b)
Key Takeaways:
List of Common Transformations and Actions in PySpark
Transformations:
Actions:
Best Practices for Performance and Scalability:
Power BI Analyst || Data Visualization Specialist || Business Intelligence Expert || DAX|| POWER QUERYEDITOR|| M LANGUAGE|| POWER BI SERVICE
3 个月Looks great
Software Developer | Backend Development | Python | Data Structures | Algorithms
3 个月Looks great
Senior Lead Data Engineer @ Incedo Inc | MS SQL,ETL,Spark,Python,Azure Data factory,Databricks, Azure synapse
4 个月Excellent work Hemavathi ??