How to use PySpark on your computer
This may be repetitive for some users, but I found that is a little difficult to get started with Apache Spark (this will focus on PySpark) on your local machine for most people.
This tutorial will only work on Windows >= 10.
I will assume you know what Apache Spark is, and what PySpark is too, but if you have questions don't mind asking them bellow.
The $ symbol will mean run in the shell (but don't copy the symbol).
Running PySpark in Jupyter
- Install Jupyter notebook
$ pip install jupyter
2. Install PySpark
Make sure you have Java 8 or higher installed on your computer. Of course, you will also need Python (I recommend > Python 3.5 from Anaconda).
Now visit the Spark downloads page. Select the latest Spark release, a prebuilt package for Hadoop, and download it directly. If you want Hive support or more fancy stuff you will have to build your spark distribution by your own -> Build Spark.
Unzip it and move it to your /opt folder:
$ tar -xzf spark-2.2.0-bin-hadoop2.6.tgz
$ mv spark-2.2.0-bin-hadoop2.6 /opt/spark-2.2.0
Create a symbolic link:
$ ln -s /opt/spark-2.2.0 /opt/spark?
Finally, tell your bash (or zsh, etc.) where to find spark. To do so, configure your $PATH variables by adding the following lines in your ~/.bashrc (or ~/.zshrc) file:
export SPARK_HOME=/opt/spark
export PATH=$SPARK_HOME/bin:$PATH
Now to run PySpark in Jupyter you'll need to update the PySpark driver environment variables. Just add these lines to your ~/.bashrc (or ~/.zshrc) file:
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
Restart (our just source) your terminal and launch PySpark:
$ pyspark
Now, this command should start a Jupyter Notebook in your web browser. Create a new notebook by clicking on ‘New’ > ‘Notebooks Python [default]’. And voilà, you have a SparkContext and SqlContext (or just SparkSession for Spark > 2.x) in your computer and can run PySpark in your notebooks (run some examples to test your environment).
Running PySpark on your favorite IDE
Sometimes you need a full IDE to create more complex code, and PySpark isn't on sys.path by default, but that doesn't mean it can't be used as a regular library. You can address this by adding PySpark to sys.path at runtime. The package findspark does that for you.
To install findspark just type:
$ pip install findspark
And then on your IDE (I use PyCharm) to initialize PySpark, just call:
import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext(appName="myAppName")
And that's it. Pretty simple right? Here is a full example of a standalone application to test PySpark locally (using the conf explained above):
import findspark
findspark.init("/opt/spark")
import random
from pyspark import SparkContext
sc = SparkContext(appName="EstimatePi")
def inside(p):
x, y = random.random(), random.random()
return x*x + y*y < 1
NUM_SAMPLES = 1000000
count = sc.parallelize(range(0, NUM_SAMPLES)) \
.filter(inside).count()
print("Pi is roughly %f" % (4.0 * count / NUM_SAMPLES))
sc.stop()
If you have anything to add, or just questions, ask them and I'll try to help you.
Cheers!
Resources: