登录查看更多内容

PySpark — Aamir P

AAMIR P

Senior Software Engineer at Tiger Analytics | Padma Shri Award nominee for the year 2023 | Author of 25+ books | Badminton Player | Udemy Instructor | Public Speaker | Podcaster | Chess Player | Coder | Yoga Volunteer |

发布日期: 2024年10月3日

As part of my learning journey and as a requirement for my new project, I have started exploring Pyspark. In this article, I shall explain Pyspark in a brief from head to tail. I’m sure, I can give you some picture through this article.

Let us start our learning journey.

PySpark is the Python API for Apache Spark, an open-source, distributed computing system used for big data processing. Apache Spark can handle large-scale data and distribute it across clusters, which allows for fast computation and efficient processing of big datasets.

Installation

pip install pyspark

To use PySpark, you first need to create a SparkSession:

from pyspark.sql import SparkSession

# Create Spark session

spark = SparkSession.builder \

.appName(“MyApp”) \

.getOrCreate()

# Check Spark version

print(spark.version)

2. DataFrames and RDDs

DataFrame: Higher-level abstraction for structured data (rows and columns) that resembles Pandas DataFrame.

# Creating a DataFrame from a list of tuples

data = [(“Alice”, 25), (“Bob”, 30), (“Cathy”, 29)]

columns = [“Name”, “Age”]

df = spark.createDataFrame(data, schema=columns)

df.show()

RDD (Resilient Distributed Dataset): Lower-level API to work with raw data.

# Creating an RDD from a list

data = [1, 2, 3, 4, 5]

rdd = spark.sparkContext.parallelize(data)

3. Transformations and Actions

In PySpark, there are two types of operations:

Transformations: Lazily evaluated operations that define what you want to do with data (e.g., map, filter).
Actions: Trigger execution and produce results (e.g., collect, count).

# Transformations: map and filter

rdd_transformed = rdd.map(lambda x: x * 2).filter(lambda x: x > 5)

# Action: collect the result

result = rdd_transformed.collect()

print(result)

In the example above, map and filter are transformations, and collect is an action that triggers the computation.

DataFrames also support a wide range of transformations.

# Selecting specific columns

df.select(“Name”).show()

# Filtering rows based on a condition

df.filter(df[“Age”] > 26).show()

4. PySpark SQL

You can use SQL queries directly in PySpark using the spark.sql() function.

# Register the DataFrame as a SQL temporary view

df.createOrReplaceTempView(“people”)

# SQL query

result = spark.sql(“SELECT * FROM people WHERE Age > 25”)

result.show()

5. Machine Learning with PySpark

PySpark provides a machine learning library called MLlib. The high-level machine learning API is found in the pyspark.ml module, which works with DataFrames.

from pyspark.ml.regression import LinearRegression

from pyspark.ml.feature import VectorAssembler

# Sample data

data = [(1, 10), (2, 20), (3, 30), (4, 40), (5, 50)]

df = spark.createDataFrame(data, [“feature”, “label”])

# Transform feature column into a vector

assembler = VectorAssembler(inputCols=[“feature”], outputCol=”features”)

df_transformed = assembler.transform(df)

# Train linear regression model

lr = LinearRegression(featuresCol=”features”, labelCol=”label”)

model = lr.fit(df_transformed)

# Show model coefficients

print(f”Coefficients: {model.coefficients}, Intercept: {model.intercept}”)

6. Optimization and Best Practices

Efficient use of PySpark requires some best practices for performance optimization:

Use Broadcast Variables: For small datasets, broadcast them to every node using broadcast().
Caching: Cache DataFrames or RDDs that are used multiple times to avoid recomputation.
Avoid Shuffle: Reduce operations that involve shuffle (like groupBy).
Use Partitioning: Split data into partitions based on logical keys for large-scale datasets.

# Cache a DataFrame in memory

df_cached = df.cache()

df_cached.count() # Action to trigger caching

7. Advanced Features

Window Functions: Perform operations over a sliding window.
Joins: Efficiently join two large datasets.
User-Defined Functions (UDFs): Apply custom functions to DataFrames.

from pyspark.sql.functions import udf

from pyspark.sql.types import StringType

# Define a Python function

def capitalize_name(name):

return name.upper()

# Register UDF

capitalize_udf = udf(capitalize_name, StringType())

# Apply UDF to DataFrame

df_with_capital_names = df.withColumn(“Name”, capitalize_udf(df[“Name”])) df_with_capital_names.show()

8. Running PySpark on a Cluster

Once you are familiar with running PySpark locally, you can scale up by running it on a cluster. Here’s how the general architecture works:

Driver Program: Coordinates tasks, schedules, and resource allocation.
Workers: Execute the actual computations.
Cluster Managers: Allocate resources, such as YARN or Mesos.

PySpark offers a powerful framework to work with large datasets using distributed computing. With its abstractions like RDDs and DataFrames, it simplifies the management of large-scale data, while SQL and MLlib provide intuitive high-level APIs. The keys to success with PySpark are learning how to manipulate data efficiently, apply machine learning models, and optimize performance on a cluster.

Not an expert in Pyspark, just now learning. Will alter the article as I learn new things in Pyspark.

Check out this link to know more about me

Let’s get to know each other!

https://lnkd.in/gdBxZC5j

Get my books, podcasts, placement preparation, etc.

https://linktr.ee/aamirp

Get my Podcasts on Spotify

https://lnkd.in/gG7km8G5

Catch me on Medium

https://lnkd.in/gi-mAPxH

Follow me on Instagram

https://lnkd.in/gkf3KPDQ

Udemy (Python Course)

https://lnkd.in/grkbfz_N

YouTube

https://www.youtube.com/@knowledge_engine_from_AamirP

Subscribe to my Channel for more useful content.

Dive Into Data with Aamir P

1,598 位关注者

要查看或添加评论，请登录

AAMIR P的更多文章

CPG (Consumer Packed Goods)— Aamir P

2025年2月12日

CPG (Consumer Packed Goods)— Aamir P

Hello Readers! In this article, we will gain some understanding about CPG. What is CPG? Things that are frequent in…

1 条评论
Dataiku — Aamir P

2024年10月11日

Dataiku — Aamir P

I found this tool very interesting and thought of sharing it with you all. I learnt this from Dataiku Academy.
Data Build Tool(DBT) — Aamir P

2024年9月19日

Data Build Tool(DBT) — Aamir P

This is a command-line environment that allows you to transform and model the data in data warehousing using SQL…
SSIS Data Warehouse Developer — Aamir P

2024年9月10日

SSIS Data Warehouse Developer — Aamir P

SQL Server is an RDBMS developed by Microsoft. It is used to store and retrieve data requested by apps.

4 条评论
Talend — Aamir P

2024年8月7日

Talend — Aamir P

Hello Readers! In this article, we will learn about Talend. Data integration is crucial for businesses facing the…
Data Warehousing and BI Analytics — Aamir P

2024年5月7日

Data Warehousing and BI Analytics — Aamir P

Hello Readers! In this article, we will have a beginner-level understanding of Data Warehousing and BI Analytics. Hope…
TensorFlow - Aamir?P

2024年4月24日

TensorFlow - Aamir?P

Hi all! This is just some overview which I’m going to write about. Some beginners were asking me for a basic…
Data Engineering — Aamir P

2024年3月29日

Data Engineering — Aamir P

Hello readers! In this article, we will see a basic workflow of Data Engineering. Let's see how data is stored…

2 条评论
SnowPark Python— Aamir P

2024年3月17日

SnowPark Python— Aamir P

Hello readers! Thank you for supporting all my articles. This article SnowPark Python I am not so confident because…
SCD Data Warehousing?-?Aamir?P

2024年1月30日

SCD Data Warehousing?-?Aamir?P

Hello Readers! Today we will see about SCD in Data Warehousing. Slowly Changing Dimensions in Data Warehousing refers…

See all articles

Installation

2. DataFrames and RDDs

3. Transformations and Actions

4. PySpark SQL

5. Machine Learning with PySpark

6. Optimization and Best Practices

7. Advanced Features

8. Running PySpark on a Cluster

Dive Into Data with Aamir P

1,598 位关注者

AAMIR P的更多文章

CPG (Consumer Packed Goods)— Aamir P

Dataiku — Aamir P

Data Build Tool(DBT) — Aamir P

SSIS Data Warehouse Developer — Aamir P

Talend — Aamir P

Data Warehousing and BI Analytics — Aamir P

TensorFlow - Aamir?P

Data Engineering — Aamir P

SnowPark Python— Aamir P

SCD Data Warehousing?-?Aamir?P