How to run PySpark jobs from a local Jupyter notebook to a Spark master in a Docker container?

问题

I have a Docker container that's running Apache Spark with a master and a slave worker. I'm attempting to submit a job from a Jupyter notebook on the host machine. See below:

# Init
!pip install findspark
import findspark
findspark.init()


# Context setup
from pyspark import SparkConf, SparkContext
# Docker container is exposing port 7077
conf = SparkConf().setAppName('test').setMaster('spark://localhost:7077')
sc = SparkContext(conf=conf)
sc

# Execute step
import random
num_samples = 1000
def inside(p):     
  x, y = random.random(), random.random()
  return x*x + y*y < 1
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples
print(pi)

The execute step shows the following error:

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: 
    Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 6, 172.17.0.2, executor 0): 

    java.io.IOException: Cannot run program "/Users/omar/anaconda3/bin/python": error=2, No such file or directory

It looks to me that the command is trying to run Spark job locally when it should be send it to the Spark master specified in the previous steps. Is this not possible through a Jupyter notebook?

My container is based off of https://hub.docker.com/r/p7hb/docker-spark/ but I installed Python 3.6 under /usr/bin/python3.6.

回答1:

I had to do the following before I created the SparkContext:

import os
# Path on master/worker where Python is installed
os.environ['PYSPARK_PYTHON'] = '/usr/bin/python3.6'

Some research showed that I need to add this to /usr/local/spark/conf/spark-env.sh via:

export PYSPARK_PYTHON='/usr/bin/python3.6'

But that isn't working.

来源：https://stackoverflow.com/questions/44788720/how-to-run-pyspark-jobs-from-a-local-jupyter-notebook-to-a-spark-master-in-a-doc

标签

python

apache-spark

pyspark