Pandas API on Apache Spark - Part 2: Hello World
Pandas API on Apache Spark brings the familiar python Pandas API on top of distributed spark framework. This combination allows python developers to write code in their favorite pandas API with all the performance and distributed benefits of spark. This marriage of API and Platform is one of the biggest improvements landing in Apache spark in recent time. This feature will be available in spark 3.2.
In this series of posts, we will be discussing different aspects of this integration. This is the second post in the series where we will write our first hello world example. You can access other posts in the series?here.
Setup
Running pandas API on spark needs Spark 3.2. At the time this blog is written, spark 3.2 is still in development. So to run these examples you need to build the spark tar from code. Once 3.2 is released in a stable version you can use it as any other pyspark program.
You can find more details on how to build code from the source below link.
Also you need to install the below libraries in your venv of python
1. Pandas >= 0.23
2. PyArrow >= 1.0