Master the Basics of PySpark: Create, Read, Transform, and Write!

Master the Basics of PySpark: Create, Read, Transform, and Write!


PySpark is your go-to framework for big data processing, and it all starts with a SparkSession. Here's a quick guide to getting started:

1?? Create a SparkSession

A SparkSession is the entry point to PySpark.

from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName("PySpark Basics") \ .getOrCreate()        


2?? Read Data

Load your data from various sources like CSV, JSON, or Parquet.

df = spark.read.csv("data.csv", header=True, inferSchema=True)        


3?? Transform Data

Apply transformations like filtering, grouping, or adding columns.

transformed_df = df.filter(df['age'] > 18).groupBy('city').count()        



4?? Write Data

Save the processed data in your desired format and mode.

  • Write Modes:overwrite: Replace existing data.append: Add to existing data.ignore: Skip if data exists.errorifexists: Throw an error if data exists.

transformed_df.write.mode("overwrite").csv("output_path")        

HAPPY LEARNING!

Muhammad Usman Shahbaz

Talk about AI & Data | AWS | SQL | Python | Node JS

3 个月

Keep it up

要查看或添加评论,请登录

Hemavathi .P的更多文章