So far I have run Spark only on Linux machines and VMs (bridged networking) but now I am interesting on utilizing more computers as slaves. It would be handy to distribute a
I'm running 3 different types of docker containers on my machine with the intention of deploying them into the cloud when all the software we need are added to them: Master, Worker and Jupyter notebook (with Scala, R and Python kernels).
Here are my observations so far:
Master:
-h "dockerhost-master" -e SPARK_MASTER_IP="dockerhost-master"
. I couldn't find a way to make Akka bind against the container's IP and but accept messages against the host IP. I know it's possible with Akka 2.4, but maybe not with Spark.-e SPARK_LOCAL_IP="${HOST_IP}"
which causes the Web UI to bind against that address instead of the container's IP, but the Web UI works all right either way.Worker:
--host
to the Spark org.apache.spark.deploy.master.Worker
class. It can't be the same as the master's or the Akka cluster will not work: -h "dockerhost-worker"
add-host
so the container is able to resolve the hostname to the master's IP: --add-host dockerhost-master:${HOST_IP}
spark://dockerhost-master:7077
Jupyter:
add-host
to be able to resolve itSparkContext
lives in the notebook and that's where the web UI of the Spark Application is started, not the master. By default it binds to the internal IP address of the Docker container. To change that I had to pass in: -e SPARK_PUBLIC_DNS="${VM_IP}" -p 4040:4040
. Subsequent applications from the notebook would be on 4041, 4042, etc.With these settings the three components are able to communicate with each other. I'm using custom startup scripts with spark-class
to launch the classes in the foreground and keep the Docker containers from quitting at the moment.
There are a few other ports that could be exposed such as the history server which I haven't encountered yet. Using --net host
seems much simpler.