ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

PySpark: Apache Spark in Pythonic Way!!

Dheeraj Satyavarapu

Associate Software Engineer 2 (Java Full Stack) at Optum

å‘å¸ƒæ—¥æœŸ: 2022å¹´4æœˆ26æ—¥

What is Apache Spark?

Apache Spark is an open-source, distributed processing system used for handling the works which uses large scale of data. For fast queries to happen which is a unique feature of Spark it uses in-memory caching and optimized query execution against variable size of data. Simply put, Spark is a fast and general engine for data processing in large scale.

Spark has the ability to unify data and AI by simplifying data preparation at a massive scale which is collected through various sources. Moreover, it provides a various and stable set of APIs for both data engineering and data science process, along with integration of popular libraries such as TensorFlow, PyTorch, R and Scikit-Learn.

Why Spark for Big Data?

To give clarity why Spark is used for Big Data Analytics is it uses cluster computing for storing large data and also analyzing the large scale data at ease. This means it can use resources from different computer processors which can be linked together for the process of analytics on the collected data. It is a scalable solution meaning that if more computational power is needed, you can simply introduce more processors into the system. With distributed storage, the huge datasets gathered for Big Data Analysis can be stored across many smaller individual parts which can be physical hard discs. Because of the above mentioned feature there will be a observable speed up read/write operations, and reason being the â€œheadâ€ which reads information from the discs has less physical distance to travel over the disc surface. As with processing power, more storage can be added when needed, and the fact it uses commonly available hardware (any standard computer hard discs) which helps us to keep down the infrastructure costs through which organizations can be benefitted.

What is PySpark?

PySpark is considered an interface for Apache Spark in Python. Through PySpark, you can write applications by using Python and provided APIâ€™s. This interface allows you to use PySpark Shell to analyze data in a distributed environment interactively.

Using PySpark, one can easily integrate and work with RDDs in Python programming language too where RDD defines as Resilient Distributed Datasets. There are numerous features that make PySpark such an amazing framework when it comes to working with huge datasets. Whether it is to perform computations on large datasets or to just analyze them, many who are working related to Big Data Analytics are shifting towards it.

A Little Demo on PySpark:

Pre-Requisites:

Java Version 8 (or) 11
Python Version 3+
Spark Version 2.7+
Hadoop Version 2.7+
Anaconda Distribution Installed

Process:

Launch Jupyter Notebook and Create a new Notebook

Install the findspark library

é¢†è‹±æŽ¨è

BigData Analytics with PySpark

Ram Narasimhan 3 å¹´å‰

PySpark

Mansi Mishra 4 å¹´å‰

Dask vs Spark

Rohan Chikorde 3 å¹´å‰

pip install findspark

Initiate the PySpark session using following code snippet

import findspark
findspark.init()
findspark.find()
import pyspark
findspark.find()

Turn on the Spark Session and set the Spark Context

from pyspark import SparkContext, SparkCon
from pyspark.sql import SparkSession
conf = pyspark.SparkConf().setAppName('SparkApp').setMaster("local")
sc = pyspark.SparkContext(conf=conf)
spark = SparkSession(sc)f

Now, Let us perform a simple operation on PySpark which converts a list of items to their respective cube values.

val = sc.parallelize([1,2,3,4])
numeric_val.map(lambda x:x*x*x).collect()

Output will look like:

Now, let us stop the Spark Session

sc.stop()

Conclusion:

In this article, we have gone through one of the finest tech in Big Data called Apache Spark and explored its support for Python through PySpark and finally we got our hands dirty by implementing a basic Spark operation in Anaconda Environment.

Jaswanth Kongara

Student and UAlbany

2 å¹´

Good Work bro

èµž

å›žå¤

Vijaya Sree Sai Reddy Dwarampudi

SEP at JPMorgan Chase & Co. | Graduate at KL University

2 å¹´

Good Work Dheeraj Satyavarapu

èµž

å›žå¤

æŸ¥çœ‹æ›´å¤šè¯„è®º

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Dheeraj Satyavarapuçš„æ›´å¤šæ–‡ç«

DIETLY

2022å¹´1æœˆ6æ—¥

DIETLY

Project Name: DIETLY Our Team: Satyavarapu Dheeraj - 190031476 Mahipathi Yugandhar - 190030985 Problem Statement:â€¦
Role of Kubernetes in DevOps

2021å¹´10æœˆ29æ—¥

Role of Kubernetes in DevOps

What is DevOps? Application size, requirements, and complexities are growing bigger and bigger. On the contrary, theâ€¦
SDP-2 USER RESEARCH

2021å¹´3æœˆ4æ—¥

SDP-2 USER RESEARCH

WHAT'S SDP? SKILL DEVELOPMENT PROJECT We are delighted to share our article of user research, related to our Businessâ€¦

2 æ¡è¯„è®º

PySpark: Apache Spark in Pythonic Way!!

Dheeraj Satyavarapu

Associate Software Engineer 2 (Java Full Stack) at Optum

What is Apache Spark?

Why Spark for Big Data?

What is PySpark?

A Little Demo on PySpark:

é¢†è‹±æŽ¨è

Conclusion:

Dheeraj Satyavarapuçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

How to Use External Python Packages in a PySpark Job on EMR Serverless: A Beginnerâ€™s Guide

Hight level API in Spark

Handling Large Data using PySpark

PySpark

Big Data Processing with Python and Apache Spark

PySpark: INTRODUCTION

Introducing Pyspark: your best friend on Azure Databricks

PYSPARK - SHOULD MANAGERS USE IT?

PySpark 101: A Beginnerâ€™s Guide to Big Data Processing

Mastering PySpark: Best Practices for Efficient Data Processing

What is Apache Spark?

Why Spark for Big Data?

What is PySpark?

A Little Demo on PySpark:

é¢†è‹±æŽ¨è

Conclusion:

Dheeraj Satyavarapuçš„æ›´å¤šæ–‡ç«

DIETLY

Role of Kubernetes in DevOps

SDP-2 USER RESEARCH

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

How to Use External Python Packages in a PySpark Job on EMR Serverless: A Beginnerâ€™s Guide

Hight level API in Spark

Handling Large Data using PySpark

PySpark

Big Data Processing with Python and Apache Spark

PySpark: INTRODUCTION

Introducing Pyspark: your best friend on Azure Databricks

PYSPARK - SHOULD MANAGERS USE IT?

PySpark 101: A Beginnerâ€™s Guide to Big Data Processing

Mastering PySpark: Best Practices for Efficient Data Processing

é¢†è‹±æŽ¨è

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†