Master the Basics of PySpark: Create, Read, Transform, and Write!
Hemavathi .P
Data Engineer @IBM | DataEngineer |3+ years experience | Hadoop | HDFS | SQL | Sqoop | Hive |PySpark | AWS | AWS Glue | AWS Emr | AWS Redshift | S3 | Lambda
PySpark is your go-to framework for big data processing, and it all starts with a SparkSession. Here's a quick guide to getting started:
1?? Create a SparkSession
A SparkSession is the entry point to PySpark.
from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName("PySpark Basics") \ .getOrCreate()
2?? Read Data
Load your data from various sources like CSV, JSON, or Parquet.
df = spark.read.csv("data.csv", header=True, inferSchema=True)
3?? Transform Data
Apply transformations like filtering, grouping, or adding columns.
transformed_df = df.filter(df['age'] > 18).groupBy('city').count()
4?? Write Data
Save the processed data in your desired format and mode.
transformed_df.write.mode("overwrite").csv("output_path")
HAPPY LEARNING!
Talk about AI & Data | AWS | SQL | Python | Node JS
3 个月Keep it up