Today, we will gonna go through how to setup ipython notebook on PySpark
1. Please install Spark based on my previous post and remember to add the following two files to ~/.bashrc and source it.
2. run pyspark at /usr/local/spark
3. Install ipython through Synaptic Packge Manager. Here’s a good tutorial
4. Some packages might be missing
sudo pip install tornado –upgrade
sudo pip install jsonschema
5. Open ipython
6.Here’s a good tutorial talking about setting up ipython with Spark
7. create spark profile
8. Creat a file in ~/.ipython/profile_spark/startup/00-pyspark-setup.py and add the following
# Configure the environment
if ‘SPARK_HOME’ not in os.environ:
os.environ[‘SPARK_HOME’] = ‘/usr/local/spark’
# Create a variable for our root path
SPARK_HOME = os.environ[‘SPARK_HOME’]
# Add the PySpark/py4j to the Python Path
sys.path.insert(0, os.path.join(SPARK_HOME, “python”, “build”))
sys.path.insert(0, os.path.join(SPARK_HOME, “python”))
The functionality of this is to open PySpark directly without interactive shell shows up in the terminal.
9. Start the ipython again
ipython notebook –profile spark
10. add ipython env varaible to bashrc to source it.
export IPYTHON_OPTS=“notebook –pylab inline”
Without the above statement, we may encounter py4j.java_gateway cannot find issue.
11. open one a python2 file and try these command line, press Alt + Enter to execute
from pyspark import SparkContext
sc = SparkContext(“local”, “pyspark”)
12. Let try typical Pi example in Python (refer to this)
The final result should like the following image with Pi value displayed
Congratulations! Now you know how to use IPython notebook with Spark!