So far I have run Spark only on Linux machines and VMs (bridged networking) but now I am interesting on utilizing more computers as slaves. It would be handy to distribute a
I'm running 3 different types of docker containers on my machine with the intention of deploying them into the cloud when all the software we need are added to them: Master, Worker and Jupyter notebook (with Scala, R and Python kernels).
Here are my observations so far:
Master:
-h "dockerhost-master" -e SPARK_MASTER_IP="dockerhost-master"
. I couldn't find a way to make Akka bind against the container's IP and but accept messages against the host IP. I know it's possible with Akka 2.4, but maybe not with Spark.-e SPARK_LOCAL_IP="${HOST_IP}"
which causes the Web UI to bind against that address instead of the container's IP, but the Web UI works all right either way.Worker:
--host
to the Spark org.apache.spark.deploy.master.Worker
class. It can't be the same as the master's or the Akka cluster will not work: -h "dockerhost-worker"
add-host
so the container is able to resolve the hostname to the master's IP: --add-host dockerhost-master:${HOST_IP}
spark://dockerhost-master:7077
Jupyter:
add-host
to be able to resolve itSparkContext
lives in the notebook and that's where the web UI of the Spark Application is started, not the master. By default it binds to the internal IP address of the Docker container. To change that I had to pass in: -e SPARK_PUBLIC_DNS="${VM_IP}" -p 4040:4040
. Subsequent applications from the notebook would be on 4041, 4042, etc.With these settings the three components are able to communicate with each other. I'm using custom startup scripts with spark-class
to launch the classes in the foreground and keep the Docker containers from quitting at the moment.
There are a few other ports that could be exposed such as the history server which I haven't encountered yet. Using --net host
seems much simpler.
I think I found a solution for my use-case (one Spark container / host OS):
--net host
with docker run
=> host's eth0 is visible in the containerSPARK_PUBLIC_DNS
and SPARK_LOCAL_IP
to host's ip, ignore the docker0's 172.x.x.x addressSpark can bind to the host's ip and other machines communicate to it as well, port forwarding takes care of the rest. DNS or any complex configs were not needed, I haven't thoroughly tested this but so far so good.
Edit: Note that these instructions are for Spark 1.x, at Spark 2.x only SPARK_PUBLIC_DNS
is required, I think SPARK_LOCAL_IP
is deprecated.
I am also running spark in containers on different docker hosts. Starting the worker container with these arguments worked for me:
docker run \
-e SPARK_WORKER_PORT=6066 \
-p 6066:6066 \
-p 8081:8081 \
--hostname $PUBLIC_HOSTNAME \
-e SPARK_LOCAL_HOSTNAME=$PUBLIC_HOSTNAME \
-e SPARK_IDENT_STRING=$PUBLIC_HOSTNAME \
-e SPARK_PUBLIC_DNS=$PUBLIC_IP \
spark ...
where $PUBLIC_HOSTNAME
is a hostname reachable from the master.
The missing piece was SPARK_LOCAL_HOSTNAME
, an undocumented option AFAICT.
https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/util/Utils.scala#L904