Running Spark driver program in Docker container - no connection back from executor to the driver?

邮差的信 提交于 2020-01-22 10:16:48

问题


UPDATE: The problem is resolved. The Docker image is here: docker-spark-submit

I run spark-submit with a fat jar inside a Docker container. My standalone Spark cluster runs on 3 virtual machines - one master and two workers. From an executor log on a worker machine, I see that the executor has the following driver URL:

"--driver-url" "spark://CoarseGrainedScheduler@172.17.0.2:5001"

172.17.0.2 is actually the address of the container with the driver program, not the host machine where the container is running. This IP is not accessible from the worker machine, therefore the worker is not able to communicate to the driver program. As I see from the source code of StandaloneSchedulerBackend, it builds driverUrl using spark.driver.host setting:

val driverUrl = RpcEndpointAddress(
  sc.conf.get("spark.driver.host"),
  sc.conf.get("spark.driver.port").toInt,
  CoarseGrainedSchedulerBackend.ENDPOINT_NAME).toString

It does not take into account SPARK_PUBLIC_DNS environment variable - is this correct? In the container, I cannot set spark.driver.host to anything else except the container "internal" IP address (172.17.0.2 in this example). When trying to set spark.driver.host to the IP address of the host machine, I get errors like this:

WARN Utils: Service 'sparkDriver' could not bind on port 5001. Attempting port 5002.

I tried to set spark.driver.bindAddress to the IP address of the host machine, but got same errors. So, how can I configure Spark to communicate with the driver program using the host machine IP address rather than Docker container address?

UPD: Stack trace from the executor:

ERROR RpcOutboxMessage: Ask timeout before connecting successfully
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
    at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:284)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout
    at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
    at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
    at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:216)
    at scala.util.Try$.apply(Try.scala:192)
    at scala.util.Failure.recover(Try.scala:216)
    at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326)
    at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326)
    at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
    at org.spark_project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
    at scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:136)
    at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
    at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
    at scala.concurrent.Promise$class.complete(Promise.scala:55)
    at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153)
    at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
    at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
    at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
    at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:63)
    at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:78)
    at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
    at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
    at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
    at scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:54)
    at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
    at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:106)
    at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
    at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
    at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
    at scala.concurrent.Promise$class.tryFailure(Promise.scala:112)
    at scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:153)
    at org.apache.spark.rpc.netty.NettyRpcEnv.org$apache$spark$rpc$netty$NettyRpcEnv$$onFailure$1(NettyRpcEnv.scala:205)
    at org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:239)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException: Cannot receive any reply in 120 seconds
    ... 8 more

回答1:


My settings, with Docker and MacOS:

  • Run Spark 1.6.3 master + worker inside the same Docker container
  • Run Java app from MacOS (via IDE)

Docker-compose opens ports:

ports:
- 7077:7077
- 20002:20002
- 6060:6060

Java config (for dev purpose):

        esSparkConf.setMaster("spark://127.0.0.1:7077");
        esSparkConf.setAppName("datahub_dev");

        esSparkConf.setIfMissing("spark.driver.port", "20002");
        esSparkConf.setIfMissing("spark.driver.host", "MAC_OS_LAN_IP");
        esSparkConf.setIfMissing("spark.driver.bindAddress", "0.0.0.0");
        esSparkConf.setIfMissing("spark.blockManager.port", "6060");



回答2:


So the working configuration is:

  • set spark.driver.host to the IP address of the host machine
  • set spark.driver.bindAddress to the IP address of the container

The working Docker image is here: docker-spark-submit.




回答3:


I noticed the other answers were using Spark Standalone (on VMs, as mentioned by OP or 127.0.0.1 as other answer).

I wanted to show what seems to work for me running a variation of jupyter/pyspark-notebook against a remote AWS Mesos cluster, and running container in Docker on Mac locally.

In which case these instuctions apply, however, --net=host does not work on anything but a Linux host.
Important step here - create the notebook user on the Mesos slaves' OS, as mentioned in the link.

This diagram was helpful for debugging networking, but it didn't mention spark.driver.blockManager.port, which was actually the final parameter that got this working, which I missed in the Spark documentation. Otherwise, the executors on the Mesos slaves try to bind that block manager port as well, and Mesos denies to allocate it.

Expose these ports so that you can access Jupyter and the Spark UI locally

  • Jupyter UI (8888)
  • Spark UI (4040)

And these ports so Mesos can reach back to the Driver: Important: Bi-directional communication must be allowed to Mesos Masters, Slaves and Zookepeeper as well...

  • "libprocess" address + port seems to get stored/broadcast in Zookeeper via LIBPROCESS_PORT variable (random:37899). Refer: Mesos documentation
  • Spark driver port (random:33139) + 16 for spark.port.maxRetries
  • Spark block manager port (random:45029) + 16 for spark.port.maxRetries

Not really relevant, but I am using Jupyter Lab interface

export EXT_IP=<your external IP>

docker run \
  -p 8888:8888 -p 4040:4040 \
  -p 37899:37899 \
  -p 33139-33155:33139-33155 \
  -p 45029-45045:45029-45045 \
  -e JUPYTER_ENABLE_LAB=y \
  -e EXT_IP \
  -e LIBPROCESS_ADVERTISE_IP=${EXT_IP} \
  -e LIBPROCESS_PORT=37899 \
  jupyter/pyspark-notebook

Once that starts, I go to localhost:8888 address for Jupyter and just open a terminal for simple spark-shell action. I could also add a volume mount for actual packaged code, but that is the next step.

I didn't edit spark-env.sh or spark-default.conf, so I pass all relevant confs to spark-shell for now. Reminder: This is inside the container

spark-shell --master mesos://zk://quorum.in.aws:2181/mesos \
  --conf spark.executor.uri=https://path.to.http.server/spark-2.4.2-bin-hadoop2.7.tgz \
  --conf spark.cores.max=1 \
  --conf spark.executor.memory=1024m \
  --conf spark.driver.host=$LIBPROCESS_ADVERTISE_IP \
  --conf spark.driver.bindAddress=0.0.0.0 \
  --conf spark.driver.port=33139 \
  --conf spark.driver.blockManager.port=45029

This loads Spark REPL, after some output about finding a Mesos master and registering a framework, I then read some files from HDFS using the NameNode IP (although I suspect any other accessible filesystem or database should work)

And I get expected output

Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.2
      /_/

Using Scala version 2.12.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_202)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.read.text("hdfs://some.hdfs.namenode:9000/tmp/README.md").show(10)
+--------------------+
|               value|
+--------------------+
|      # Apache Spark|
|                    |
|Spark is a fast a...|
|high-level APIs i...|
|supports general ...|
|rich set of highe...|
|MLlib for machine...|
|and Spark Streami...|
|                    |
|<http://spark.apa...|
+--------------------+
only showing top 10 rows


来源:https://stackoverflow.com/questions/45489248/running-spark-driver-program-in-docker-container-no-connection-back-from-execu

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!