PySpark: Apache Spark in Pythonic Way!!
What is Apache Spark?
Apache Spark is an open-source, distributed processing system used for handling the works which uses large scale of data. For fast queries to happen which is a unique feature of Spark it uses in-memory caching and optimized query execution against variable size of data. Simply put, Spark is a fast and general engine for data processing in large scale.
Spark has the ability to unify data and AI by simplifying data preparation at a massive scale which is collected through various sources. Moreover, it provides a various and stable set of APIs for both data engineering and data science process, along with integration of popular libraries such as TensorFlow, PyTorch, R and Scikit-Learn.
Why Spark for Big Data?
To give clarity why Spark is used for Big Data Analytics is it uses cluster computing for storing large data and also analyzing the large scale data at ease. This means it can use resources from different computer processors which can be linked together for the process of analytics on the collected data. It is a scalable solution meaning that if more computational power is needed, you can simply introduce more processors into the system. With distributed storage, the huge datasets gathered for Big Data Analysis can be stored across many smaller individual parts which can be physical hard discs. Because of the above mentioned feature there will be a observable speed up read/write operations, and reason being the “head†which reads information from the discs has less physical distance to travel over the disc surface. As with processing power, more storage can be added when needed, and the fact it uses commonly available hardware (any standard computer hard discs) which helps us to keep down the infrastructure costs through which organizations can be benefitted.
What is PySpark?
PySpark is considered an interface for Apache Spark in Python. Through PySpark, you can write applications by using Python and provided API’s. This interface allows you to use PySpark Shell to analyze data in a distributed environment interactively.
Using PySpark, one can easily integrate and work with RDDs in Python programming language too where RDD defines as Resilient Distributed Datasets. There are numerous features that make PySpark such an amazing framework when it comes to working with huge datasets. Whether it is to perform computations on large datasets or to just analyze them, many who are working related to Big Data Analytics are shifting towards it.
A Little Demo on PySpark:
Pre-Requisites:
- Java Version 8 (or) 11
- Python Version 3+
- Spark Version 2.7+
- Hadoop Version 2.7+
- Anaconda Distribution Installed
Process:
Launch Jupyter Notebook and Create a new Notebook
Install the findspark library
领英推è
pip install findspark
Initiate the PySpark session using following code snippet
import findspark
findspark.init()
findspark.find()
import pyspark
findspark.find()
Turn on the Spark Session and set the Spark Context
from pyspark import SparkContext, SparkCon
from pyspark.sql import SparkSession
conf = pyspark.SparkConf().setAppName('SparkApp').setMaster("local")
sc = pyspark.SparkContext(conf=conf)
spark = SparkSession(sc)f
Now, Let us perform a simple operation on PySpark which converts a list of items to their respective cube values.
val = sc.parallelize([1,2,3,4])
numeric_val.map(lambda x:x*x*x).collect()
Output will look like:
Now, let us stop the Spark Session
sc.stop()
Conclusion:
In this article, we have gone through one of the finest tech in Big Data called Apache Spark and explored its support for Python through PySpark and finally we got our hands dirty by implementing a basic Spark operation in Anaconda Environment.
Student and UAlbany
2 å¹´Good Work bro
SEP at JPMorgan Chase & Co. | Graduate at KL University
2 å¹´Good Work Dheeraj Satyavarapu