登录查看更多内容

How to use PySpark on your computer

Favio Vazquez

CTO | Lead AI Scientist | LinkedIn Top Voice | AI & ML Evangelist | Drummer

发布日期: 2017年7月26日

This may be repetitive for some users, but I found that is a little difficult to get started with Apache Spark (this will focus on PySpark) on your local machine for most people.

This tutorial will only work on Windows >= 10.

I will assume you know what Apache Spark is, and what PySpark is too, but if you have questions don't mind asking them bellow.

The $ symbol will mean run in the shell (but don't copy the symbol).

Running PySpark in Jupyter

Install Jupyter notebook

$ pip install jupyter

2. Install PySpark

Make sure you have Java 8 or higher installed on your computer. Of course, you will also need Python (I recommend > Python 3.5 from Anaconda).

Now visit the Spark downloads page. Select the latest Spark release, a prebuilt package for Hadoop, and download it directly. If you want Hive support or more fancy stuff you will have to build your spark distribution by your own -> Build Spark.

Unzip it and move it to your /opt folder:

$ tar -xzf spark-2.2.0-bin-hadoop2.6.tgz
$ mv spark-2.2.0-bin-hadoop2.6 /opt/spark-2.2.0

Create a symbolic link:

$ ln -s /opt/spark-2.2.0 /opt/spark?

Finally, tell your bash (or zsh, etc.) where to find spark. To do so, configure your $PATH variables by adding the following lines in your ~/.bashrc (or ~/.zshrc) file:

export SPARK_HOME=/opt/spark
export PATH=$SPARK_HOME/bin:$PATH

Now to run PySpark in Jupyter you'll need to update the PySpark driver environment variables. Just add these lines to your ~/.bashrc (or ~/.zshrc) file:

export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'

Restart (our just source) your terminal and launch PySpark:

$ pyspark

Now, this command should start a Jupyter Notebook in your web browser. Create a new notebook by clicking on ‘New’ > ‘Notebooks Python [default]’. And voilà, you have a SparkContext and SqlContext (or just SparkSession for Spark > 2.x) in your computer and can run PySpark in your notebooks (run some examples to test your environment).

Running PySpark on your favorite IDE

Sometimes you need a full IDE to create more complex code, and PySpark isn't on sys.path by default, but that doesn't mean it can't be used as a regular library. You can address this by adding PySpark to sys.path at runtime. The package findspark does that for you.

To install findspark just type:

$ pip install findspark

And then on your IDE (I use PyCharm) to initialize PySpark, just call:

import findspark
findspark.init()

import pyspark
sc = pyspark.SparkContext(appName="myAppName")

And that's it. Pretty simple right? Here is a full example of a standalone application to test PySpark locally (using the conf explained above):

import findspark
findspark.init("/opt/spark")
import random

from pyspark import SparkContext

sc = SparkContext(appName="EstimatePi")

def inside(p):
    x, y = random.random(), random.random()
    return x*x + y*y < 1

NUM_SAMPLES = 1000000

count = sc.parallelize(range(0, NUM_SAMPLES)) \
             .filter(inside).count()

print("Pi is roughly %f" % (4.0 * count / NUM_SAMPLES))

sc.stop()

If you have anything to add, or just questions, ask them and I'll try to help you.

Cheers!

Resources:

Sicara
Apache Spark
FindSpark

要查看或添加评论，请登录

Favio Vazquez的更多文章

S2-E5: Exploring and Preparing Data

2021年4月14日

S2-E5: Exploring and Preparing Data

Hello! And welcome to the fifth episode from the Data Science Now newsletter about the project: Basics of Data Science.…

11 条评论
S2-E4: Data Collection

2020年7月2日

S2-E4: Data Collection

Hello! And welcome to the fourth episode from the Data Science Now newsletter about the project: Basics of Data…

16 条评论
S2-E3: Business Understanding. Part 2.

2020年6月10日

S2-E3: Business Understanding. Part 2.

Hello! And welcome to the second episode from the Data Science Now newsletter about the project: Basics of Data…

3 条评论
S2-E2: Business Understanding. Part 1.

2020年5月27日

S2-E2: Business Understanding. Part 1.

Hello! And welcome to the first the Data Science Now newsletter about the project Basics of Data Science. Let me remind…

4 条评论
S2-E1: Basics of Data Science

2020年5月19日

S2-E1: Basics of Data Science

Hello! And welcome to a new season of the Data Science Now newsletter. In this season, we will be discussing the basics…

5 条评论
Episode 10: Best Books to Study Machine Learning

2020年4月24日

Episode 10: Best Books to Study Machine Learning

Hello! And welcome to a new edition of the Data Science Now newsletter. In this session, I talked about the best books…

13 条评论
Episode 9: How Netflix Recommends Shows and Movies

2020年3月28日

Episode 9: How Netflix Recommends Shows and Movies

I want to thank Daniel Mora, most of what you are seeing in this newsletter comes from him and his analysis. Thank you…

5 条评论
Episode 8: Understanding the coronavirus (COVID-19) with Data Science

2020年3月23日

Episode 8: Understanding the coronavirus (COVID-19) with Data Science

Hello! And welcome to a new edition of the Data Science Now newsletter. In this session, I talked about how to download…

3 条评论
Episode 7: Programming languages for Data Science

2020年3月20日

Episode 7: Programming languages for Data Science

Hello! And welcome to a new edition of the Data Science Now newsletter. In this session, I talked about the most…

4 条评论
Episode 6: Business understanding for Data Science

2020年3月20日

Episode 6: Business understanding for Data Science

Hello! And welcome to a new edition of the Data Science Now newsletter. In this session, I talked about the importance…

See all articles

How to use PySpark on your computer

Favio Vazquez

CTO | Lead AI Scientist | LinkedIn Top Voice | AI & ML Evangelist | Drummer

Running PySpark in Jupyter

Running PySpark on your favorite IDE

Favio Vazquez的更多文章

社区洞察

其他会员也浏览了

Mastering Python for Data Engineering: Tools, Techniques, and Real-World Use Cases

Handling Big Data with Python

What are the benefits of using PySpark for Data Analysis?

Python for Big Data: Essential Libraries and Techniques

Building Azure Data Factory pipelines using Python

Dask vs Spark

Evaluating Which Python library is Best suitable for Bulk insert into Aurora Postgres SQL | Speed Comparison

Python vs SQL

Discover 5 cutting-edge data science tools that are essential for your Python toolkit

Building and Deploying a Flight Tracking Application: A Data-Centric Approach with Python, Docker, Postgres, and Airflow by Fidel Vetino

Running PySpark in Jupyter

Running PySpark on your favorite IDE

Favio Vazquez的更多文章

S2-E5: Exploring and Preparing Data

S2-E4: Data Collection

S2-E3: Business Understanding. Part 2.

S2-E2: Business Understanding. Part 1.

S2-E1: Basics of Data Science

Episode 10: Best Books to Study Machine Learning

Episode 9: How Netflix Recommends Shows and Movies

Episode 8: Understanding the coronavirus (COVID-19) with Data Science

Episode 7: Programming languages for Data Science

Episode 6: Business understanding for Data Science

社区洞察

其他会员也浏览了

Mastering Python for Data Engineering: Tools, Techniques, and Real-World Use Cases

Handling Big Data with Python

What are the benefits of using PySpark for Data Analysis?

Python for Big Data: Essential Libraries and Techniques

Building Azure Data Factory pipelines using Python

Dask vs Spark

Evaluating Which Python library is Best suitable for Bulk insert into Aurora Postgres SQL | Speed Comparison

Python vs SQL

Discover 5 cutting-edge data science tools that are essential for your Python toolkit

Building and Deploying a Flight Tracking Application: A Data-Centric Approach with Python, Docker, Postgres, and Airflow by Fidel Vetino