Basics Of Data Cleaning and Manipulation with PySpark
Sushan Kattel
Data Engineer @ Fusemachines | Data Science & Computer Vision | Love to share what I learn
PySpark is a powerful Python library for large-scale data processing and analysis built on top of Apache Spark, particularly suited for handling big data tasks. It provides a simple and efficient API for working with large-scale datasets, enabling parallel processing across clusters of computers. In this article, we will explore the fundamentals of data cleaning and manipulation with PySpark, starting with the basics. So, let's begin !
Brief Overview of PySpark
Installation and Setup
PySpark can be installed via pip, but it requires a Spark installation as well.Alternatively, you can set up PySpark using packages like pyspark and findspark, which handle the Spark installation and environment setup for you.
You can follow this official guide.
Here's a basic installation and setup process:
# Install PySpark package
pip install pyspark
# Install findspark package (optional but recommended)
pip install findspark
# Import and initialize findspark to locate Spark installation
import findspark
findspark.init()
# Import PySpark modules
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder \
.appName("MyPySparkApp") \
.getOrCreate()
领英推荐
Loading Data
In PySpark, you can load data from various sources such as CSV, JSON, Parquet, databases, etc. Here's how you can load data and perform basic exploration:
Loading Data from Various Sources
# Reading CSV file
df_csv = spark.read.csv("path/to/file.csv", header=True, inferSchema=True)
# Reading JSON file
df_json = spark.read.json("path/to/file.json")
# Reading Parquet file
df_parquet = spark.read.parquet("path/to/file.parquet")
Basic Data Exploration Techniques
# Display the schema of the DataFrame
df_csv.printSchema()
# Display the first few rows of the DataFrame
df_csv.show()
# Get summary statistics for numerical columns
df_csv.describe().show()
Additional Data Loading Techniques
View complete article at: https://medium.com/@sushankattel/basics-of-data-cleaning-and-manipulation-with-pyspark-2dbb5b7fd413