How to use PySpark on your computer

How to use PySpark on your computer

This may be repetitive for some users, but I found that is a little difficult to get started with Apache Spark (this will focus on PySpark) on your local machine for most people.

This tutorial will only work on Windows >= 10.

I will assume you know what Apache Spark is, and what PySpark is too, but if you have questions don't mind asking them bellow.

The $ symbol will mean run in the shell (but don't copy the symbol).

Running PySpark in Jupyter

  1. Install Jupyter notebook 
$ pip install jupyter

2. Install PySpark

Make sure you have Java 8 or higher installed on your computer. Of course, you will also need Python (I recommend > Python 3.5 from Anaconda).

Now visit the Spark downloads page. Select the latest Spark release, a prebuilt package for Hadoop, and download it directly. If you want Hive support or more fancy stuff you will have to build your spark distribution by your own -> Build Spark.

Unzip it and move it to your /opt folder:

$ tar -xzf spark-2.2.0-bin-hadoop2.6.tgz
$ mv spark-2.2.0-bin-hadoop2.6 /opt/spark-2.2.0

Create a symbolic link:

$ ln -s /opt/spark-2.2.0 /opt/spark?

Finally, tell your bash (or zsh, etc.) where to find spark. To do so, configure your $PATH variables by adding the following lines in your ~/.bashrc (or ~/.zshrc) file:

export SPARK_HOME=/opt/spark
export PATH=$SPARK_HOME/bin:$PATH

Now to run PySpark in Jupyter you'll need to update the PySpark driver environment variables. Just add these lines to your ~/.bashrc (or ~/.zshrc) file:

export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'

Restart (our just source) your terminal and launch PySpark:

$ pyspark

Now, this command should start a Jupyter Notebook in your web browser. Create a new notebook by clicking on ‘New’ > ‘Notebooks Python [default]’. And voilà, you have a SparkContext and SqlContext (or just SparkSession for Spark > 2.x) in your computer and can run PySpark in your notebooks (run some examples to test your environment).

Running PySpark on your favorite IDE

Sometimes you need a full IDE to create more complex code, and PySpark isn't on sys.path by default, but that doesn't mean it can't be used as a regular library. You can address this by adding PySpark to sys.path at runtime. The package findspark does that for you.

To install findspark just type:

$ pip install findspark

And then on your IDE (I use PyCharm) to initialize PySpark, just call:

import findspark
findspark.init()

import pyspark
sc = pyspark.SparkContext(appName="myAppName")

And that's it. Pretty simple right? Here is a full example of a standalone application to test PySpark locally (using the conf explained above):

import findspark
findspark.init("/opt/spark")
import random

from pyspark import SparkContext

sc = SparkContext(appName="EstimatePi")

def inside(p):
    x, y = random.random(), random.random()
    return x*x + y*y < 1

NUM_SAMPLES = 1000000

count = sc.parallelize(range(0, NUM_SAMPLES)) \
             .filter(inside).count()

print("Pi is roughly %f" % (4.0 * count / NUM_SAMPLES))

sc.stop()

If you have anything to add, or just questions, ask them and I'll try to help you.

Cheers!

Resources:

要查看或添加评论,请登录

Favio Vazquez的更多文章

  • S2-E5: Exploring and Preparing Data

    S2-E5: Exploring and Preparing Data

    Hello! And welcome to the fifth episode from the Data Science Now newsletter about the project: Basics of Data Science.…

    11 条评论
  • S2-E4: Data Collection

    S2-E4: Data Collection

    Hello! And welcome to the fourth episode from the Data Science Now newsletter about the project: Basics of Data…

    16 条评论
  • S2-E3: Business Understanding. Part 2.

    S2-E3: Business Understanding. Part 2.

    Hello! And welcome to the second episode from the Data Science Now newsletter about the project: Basics of Data…

    3 条评论
  • S2-E2: Business Understanding. Part 1.

    S2-E2: Business Understanding. Part 1.

    Hello! And welcome to the first the Data Science Now newsletter about the project Basics of Data Science. Let me remind…

    4 条评论
  • S2-E1: Basics of Data Science

    S2-E1: Basics of Data Science

    Hello! And welcome to a new season of the Data Science Now newsletter. In this season, we will be discussing the basics…

    5 条评论
  • Episode 10: Best Books to Study Machine Learning

    Episode 10: Best Books to Study Machine Learning

    Hello! And welcome to a new edition of the Data Science Now newsletter. In this session, I talked about the best books…

    13 条评论
  • Episode 9: How Netflix Recommends Shows and Movies

    Episode 9: How Netflix Recommends Shows and Movies

    I want to thank Daniel Mora, most of what you are seeing in this newsletter comes from him and his analysis. Thank you…

    5 条评论
  • Episode 8: Understanding the coronavirus (COVID-19) with Data Science

    Episode 8: Understanding the coronavirus (COVID-19) with Data Science

    Hello! And welcome to a new edition of the Data Science Now newsletter. In this session, I talked about how to download…

    3 条评论
  • Episode 7: Programming languages for Data Science

    Episode 7: Programming languages for Data Science

    Hello! And welcome to a new edition of the Data Science Now newsletter. In this session, I talked about the most…

    4 条评论
  • Episode 6: Business understanding for Data Science

    Episode 6: Business understanding for Data Science

    Hello! And welcome to a new edition of the Data Science Now newsletter. In this session, I talked about the importance…

社区洞察

其他会员也浏览了