Spark SPARK_PUBLIC_DNS and SPARK_LOCAL_IP on stand-alone cluster with docker containers

后端 未结 3 458
陌清茗
陌清茗 2020-12-09 06:09

So far I have run Spark only on Linux machines and VMs (bridged networking) but now I am interesting on utilizing more computers as slaves. It would be handy to distribute a

相关标签:
3条回答
  • 2020-12-09 06:21

    I'm running 3 different types of docker containers on my machine with the intention of deploying them into the cloud when all the software we need are added to them: Master, Worker and Jupyter notebook (with Scala, R and Python kernels).

    Here are my observations so far:

    Master:

    • I couldn't make it bind to the Docker Host IP. Instead, I pass in a made up domain name to it: -h "dockerhost-master" -e SPARK_MASTER_IP="dockerhost-master". I couldn't find a way to make Akka bind against the container's IP and but accept messages against the host IP. I know it's possible with Akka 2.4, but maybe not with Spark.
    • I'm passing in -e SPARK_LOCAL_IP="${HOST_IP}" which causes the Web UI to bind against that address instead of the container's IP, but the Web UI works all right either way.

    Worker:

    • I gave the worker container a different hostname and pass it as --host to the Spark org.apache.spark.deploy.master.Worker class. It can't be the same as the master's or the Akka cluster will not work: -h "dockerhost-worker"
    • I'm using Docker's add-host so the container is able to resolve the hostname to the master's IP: --add-host dockerhost-master:${HOST_IP}
    • The master URL that needs to be passed is spark://dockerhost-master:7077

    Jupyter:

    • This one needs the master URL and add-host to be able to resolve it
    • The SparkContext lives in the notebook and that's where the web UI of the Spark Application is started, not the master. By default it binds to the internal IP address of the Docker container. To change that I had to pass in: -e SPARK_PUBLIC_DNS="${VM_IP}" -p 4040:4040. Subsequent applications from the notebook would be on 4041, 4042, etc.

    With these settings the three components are able to communicate with each other. I'm using custom startup scripts with spark-class to launch the classes in the foreground and keep the Docker containers from quitting at the moment.

    There are a few other ports that could be exposed such as the history server which I haven't encountered yet. Using --net host seems much simpler.

    0 讨论(0)
  • 2020-12-09 06:33

    I think I found a solution for my use-case (one Spark container / host OS):

    1. Use --net host with docker run => host's eth0 is visible in the container
    2. Set SPARK_PUBLIC_DNS and SPARK_LOCAL_IP to host's ip, ignore the docker0's 172.x.x.x address

    Spark can bind to the host's ip and other machines communicate to it as well, port forwarding takes care of the rest. DNS or any complex configs were not needed, I haven't thoroughly tested this but so far so good.

    Edit: Note that these instructions are for Spark 1.x, at Spark 2.x only SPARK_PUBLIC_DNS is required, I think SPARK_LOCAL_IP is deprecated.

    0 讨论(0)
  • 2020-12-09 06:34

    I am also running spark in containers on different docker hosts. Starting the worker container with these arguments worked for me:

    docker run \
    -e SPARK_WORKER_PORT=6066 \
    -p 6066:6066 \
    -p 8081:8081 \
    --hostname $PUBLIC_HOSTNAME \
    -e SPARK_LOCAL_HOSTNAME=$PUBLIC_HOSTNAME \
    -e SPARK_IDENT_STRING=$PUBLIC_HOSTNAME \
    -e SPARK_PUBLIC_DNS=$PUBLIC_IP \
    spark ...
    

    where $PUBLIC_HOSTNAME is a hostname reachable from the master.

    The missing piece was SPARK_LOCAL_HOSTNAME, an undocumented option AFAICT.

    https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/util/Utils.scala#L904

    0 讨论(0)
提交回复
热议问题