Spark on yarn concept understanding

后端 未结 3 518
死守一世寂寞
死守一世寂寞 2020-11-28 21:51

I am trying to understand how spark runs on YARN cluster/client. I have the following question in my mind.

  1. Is it necessary that spark is installed on all th

3条回答
  •  孤街浪徒
    2020-11-28 22:25

    Adding to other answers.

    1. Is it necessary that spark is installed on all the nodes in yarn cluster?

    No, If the spark job is scheduling in YARN(either client or cluster mode). Spark installation needed in many nodes only for standalone mode.

    These are the visualisations of spark app deployment modes.

    Spark Standalone Cluster Spark standalone mode

    In cluster mode driver will be sitting in one of the Spark Worker node whereas in client mode it will be within the machine which launched the job.


    YARN cluster mode YARN cluster mode

    YARN client mode YARN client mode

    This table offers a concise list of differences between these modes:

    differences among Standalone, YARN Cluster and YARN Client modes

    pics source

    1. It says in the documentation "Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the Hadoop cluster". Why does client node have to install Hadoop when it is sending the job to cluster?

    Hadoop installation is not mandatory but configurations(not all) are!. We can call them as Gateway nodes. It's for two main reasons.

    • The configuration contained in HADOOP_CONF_DIR directory will be distributed to the YARN cluster so that all containers used by the application use the same configuration.
    • In YARN mode the ResourceManager’s address is picked up from the Hadoop configuration(yarn-default.xml). Thus, the --master parameter is yarn.


    Update: (2017-01-04)

    Spark 2.0+ no longer requires a fat assembly jar for production deployment. source

提交回复
热议问题