登录查看更多内容

Mastering ETL Processes Using PySpark

Dr. Fatma Ben Mesmia Chaabouni

Assistant Professor in Computer Science @ CU Ulster University, Qatar |Ph.D. in CS | MSc_B.Sc. in CS| NLP-AI and Data Analytics- Blockchain researcher | MBA mentor| Tunisian AI Society Member

发布日期: 2024年7月28日

Today's tutorial uses the Titanic dataset to demonstrate ETL (Extract, Transform, Load) processes using PySpark. This guide will help you understand how to set up your environment, extract data from a CSV file, transform the data, and load the transformed data back into a file. Each step will be explained with a well-commented code.

Step 1: Environment Setup and SparkSession Creation

Install PySpark

First, you need to install PySpark if you haven't already:

pip install pyspark

Create a SparkSession

The SparkSession is the entry point to programming Spark with the Dataset and DataFrame API. We start by creating a SparkSession:

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName('Titanic ETL Process').getOrCreate()

Step 2: Data Extraction

Read Data from CSV

We will read the Titanic dataset from a CSV file. Ensure the dataset is in your working directory or provide the correct path to the file.

# Read Titanic dataset from CSV
df = spark.read.csv('titanic.csv', inferSchema=True, header=True)

# Show the schema to verify the data has been loaded correctly
df.printSchema()

# Show the first few rows of the data frame
df.show(5)

Step 3: Data Transformation

Selecting Columns

Since our dataset only contains PassengerId and Survived, we can skip this step.

Filtering Data

We will filter the data to include only passengers who survived.

# Filter passengers who survived
df_survived = df.filter(df['Survived'] == 1)

# Show the first few rows of the dataframe
df_survived.show(5)

Adding New Columns

We can add a new column to indicate whether the passenger survived or not in a more descriptive manner.

Benjamin Bennett Alexander 2 个月前

DP-600 Lab Summary Series - Lab 2 Analyze data with…

Arno Wakfer 4 个月前

No Code - Convert XLS/CSV files into Parquet with…

Alex Merced 9 个月前

from pyspark.sql.functions import when

# Create a new column 'SurvivalStatus' based on the 'Survived' column
df = df.withColumn('SurvivalStatus', when(df['Survived'] == 1, 'Survived')
                                          .otherwise('Not Survived'))

# Show the first few rows of the dataframe
df.show(5)

Renaming Columns

We can rename columns if needed. Here, we already renamed 'Survived' to 'SurvivalStatus' in the previous step, so we will skip this.

Grouping and Aggregating Data

We group the data by Survived and count the number of passengers in each group.

# Group by 'Survived' and count the number of passengers in each group
df_grouped = df.groupBy('Survived').count()

# Show the grouped data
df_grouped.show()

Step 4: Handling Missing Values

Since we have a limited dataset, we will assume there are no missing values to handle.

Step 5: Data Type Conversion

No specific data type conversion is needed for this limited dataset.

Step 6: Advanced Data Manipulations

Using SQL Queries

We can run SQL queries on the DataFrame by creating a temporary view.

# Create a temporary view
df.createOrReplaceTempView('titanic')

# Run an SQL query to select passengers who survived
sql_df = spark.sql('SELECT * FROM titanic WHERE Survived = 1')

# Show the result of the SQL query
sql_df.show(5)

Step 7: Data Loading

Finally, we load the transformed data back into a CSV file.

Writing to CSV

# Write the transformed data to a new CSV file
df.write.csv('titanic_transformed.csv', header=True)

Feel free to connect with me on LinkedIn for more insights into big data and PySpark!

Happy ETL-ing

Mastering ETL Processes Using PySpark

Dr. Fatma Ben Mesmia Chaabouni

Assistant Professor in Computer Science @ CU Ulster University, Qatar |Ph.D. in CS | MSc_B.Sc. in CS| NLP-AI and Data Analytics- Blockchain researcher | MBA mentor| Tunisian AI Society Member

Step 1: Environment Setup and SparkSession Creation

Install PySpark

Create a SparkSession

Step 2: Data Extraction

Read Data from CSV

Step 3: Data Transformation

Selecting Columns

Filtering Data

Adding New Columns

领英推荐

Renaming Columns

Grouping and Aggregating Data

Step 4: Handling Missing Values

Step 5: Data Type Conversion

Step 6: Advanced Data Manipulations

Using SQL Queries

Step 7: Data Loading

Writing to CSV

更多精彩文章

社区洞察

其他会员也浏览了

Data Analysis Power with Pandas DataFrames

GroupBy #10: Netflix's Psyberg, Parquet format, SQL is not Designed for Analytics

GroupBy #9: FDAP stack, Iceberg and Hudi ACID Guarantees, Data Driven Management

ETL vs ELT: A Surprising Insight About How Dangerous Data Transformations Are

The Roadmap to Becoming a Data Engineering Jedi

Building an ETL App with Streamlit

Efficiently Managing Ride and Late Arriving Tips Data with Incremental ETL using Apache Hudi : Step by Step Guide

Data Engineering Best Practices with Scala: Unlocking the Power of Big Data

Best Practices and Spark optimisation Tips for Data engineers

What is Delta Live Tables?

Step 1: Environment Setup and SparkSession Creation

Install PySpark

Create a SparkSession

Step 2: Data Extraction

Read Data from CSV

Step 3: Data Transformation

Selecting Columns

Filtering Data

Adding New Columns

领英推荐

Renaming Columns

Grouping and Aggregating Data

Step 4: Handling Missing Values

Step 5: Data Type Conversion

Step 6: Advanced Data Manipulations

Using SQL Queries

Step 7: Data Loading

Writing to CSV

Introduction to the Financial Analytics with Python

2024年9月13日

Insights from Online Retail Data

2024年8月7日

Salary Prediction with Python

2024年8月5日

Mastering C++ Fundamentals

2024年8月3日

Python's Power in Gaming: Insights on Logic and Life

2024年8月1日

Rock, Paper, Scissors: A Fun and Educational Game for All Ages

2024年7月30日

Transforming Theoretical Sessions into Enjoyable and Practical Experiences: A Compliance Reminder Game

2024年7月29日

Exploring Sentiment Analysis with Python: A Case Study

2024年7月27日

The Art of Numbers and Letters: My Journey into AI and Data Analytics

2024年7月26日

社区洞察

其他会员也浏览了

Data Analysis Power with Pandas DataFrames

GroupBy #10: Netflix's Psyberg, Parquet format, SQL is not Designed for Analytics

GroupBy #9: FDAP stack, Iceberg and Hudi ACID Guarantees, Data Driven Management

ETL vs ELT: A Surprising Insight About How Dangerous Data Transformations Are

The Roadmap to Becoming a Data Engineering Jedi

Building an ETL App with Streamlit

Efficiently Managing Ride and Late Arriving Tips Data with Incremental ETL using Apache Hudi : Step by Step Guide

Data Engineering Best Practices with Scala: Unlocking the Power of Big Data

Best Practices and Spark optimisation Tips for Data engineers

What is Delta Live Tables?

GroupBy #10: Netflix's Psyberg, Parquet format, SQL is not Designed for Analytics