Spark standalone connection driver to worker

问题

I'm trying to host locally a spark standalone cluster. I have two heterogeneous machines connected on a LAN. Each piece of the architecture listed below is running on docker. I have the following configuration

master on machine 1 (port 7077 exposed)
worker on machine 1
driver on machine 2

I use a test application that opens a file and counts its lines. The application works when the file replicated on all workers and I use SparkContext.readText()

But when when the file is only present on worker while I'm using SparkContext.parallelize() to access it on workers, I have the following display :

INFO StandaloneSchedulerBackend: Granted executor ID app-20180116210619-0007/4 on hostPort 172.17.0.3:6598 with 4 cores, 1024.0 MB RAM
INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20180116210619-0007/4 is now RUNNING
INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20180116210619-0007/4 is now EXITED (Command exited with code 1)
INFO StandaloneSchedulerBackend: Executor app-20180116210619-0007/4 removed: Command exited with code 1
INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20180116210619-0007/5 on worker-20180116205132-172.17.0.3-6598 (172.17.0.3:6598) with 4 cores```

that goes on and on again without actually computing the app.

This is working when I put the driver on the same pc as the worker. So I guess there is some kind of connection to permit between so two across the network. Are you aware of a way to do that (which ports to open, which adress to add in /etc/hosts ...)

回答1:

TL;DR Make sure that spark.driver.host:spark.driver.port can be accessed from each node in the cluster.

In general you have ensure that all nodes (both executors and master) can reach the driver.

In the cluster mode, where driver runs on one of the executors this is satisfied by default, as long as no ports are closed for the connections (see below).
In client mode machine, on which driver has been started, has to be accessible from the cluster. It means that spark.driver.host has to resolve to a publicly reachable address.

In both cases you have to keep in mind, that by default driver runs on a random port. It is possible to use a fixed one by setting spark.driver.port. Obviously this doesn't work that well, if you want to submit multiple applications at the same time.

Furthermore:

when when the file is only present on worker

won't work. All inputs have to be accessible from driver, as well as, from each executor node.

来源：https://stackoverflow.com/questions/48289881/spark-standalone-connection-driver-to-worker

标签

apache-spark

spark-submit

apache-spark-standalone