Setting yarn shuffle for spark makes spark-shell not start

半腔热情 提交于 2019-12-08 04:24:05

问题


I have a 4 ubuntu 14.04 machines cluster where I am setting up spark 2.1.0 prebuilt for hadoop 2.7 to run on top of hadoop 2.7.3 and I am configuring it to work with yarn. Running jps in each node I get:

  1. node-1
    • 22546 Master
    • 22260 ResourceManager
    • 22916 Jps
    • 21829 NameNode
    • 22091 SecondaryNameNode
  2. node-2
    • 12321 Worker
    • 12485 Jps
    • 11978 DataNode
  3. node-3
    • 15938 Jps
    • 15764 Worker
    • 15431 DataNode
  4. node-4
    • 12251 Jps
    • 12075 Worker
    • 11742 DataNode

Without yarn shuffle configuration

./bin/spark-shell --master yarn --deploy-mode client

starts just fine when called in my node-1.

In order to configure a External Shuffle Service, I read this: http://spark.apache.org/docs/2.1.0/running-on-yarn.html#configuring-the-external-shuffle-service And what I have done is:

Added the following properties to yarn-site.xml:

<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>spark_shuffle</value>
</property>
<property>
    <name>yarn.nodemanager.aux-services.spark_shuffle.class</name>
    <value>org.apache.spark.network.yarn.YarnShuffleService</value>
</property>
<property>
    <name>yarn.application.classpath</name>
    <value>/usr/local/spark/spark-2.1.0-bin-hadoop2.7/yarn/spark-2.1.0-yarn-shuffle.jar</value>
</property>

I do have other properties in this file. Leaving these 3 properties out, as I said, let spark-shell --master yarn --deploy-mode client start normally.

My spark-default.conf is:

spark.master                           spark://singapura:7077
spark.executor.memory                  4g
spark.driver.memory                    2g
spark.eventLog.enabled                 true
spark.eventLog.dir                     hdfs://singapura:8020/spark/logs
spark.history.fs.logDirectory          hdfs://singapura:8020/spark/logs
spark.history.provider                 org.apache.spark.deploy.history.FsHistoryProvider
spark.serializer                       org.apache.spark.serializer.KryoSerializer
spark.dynamicAllocation.enabled        true
spark.shuffle.service.enabled          true
spark.scheduler.mode                   FAIR
spark.yarn.stagingDir                  hdfs://singapura:8020/spark
spark.yarn.jars=hdfs://singapura:8020/spark/jars/*.jar
spark.yarn.am.memory                   2g
spark.yarn.am.cores                    4

All nodes have the same paths. singapura is my node-1. It's already set in my /etc/hosts and nslookup gets the correct ip. The machine name is not the issue here.

So, What happens to me is: when I add these 3 properties to my yarn-site.xml and start spark shell, it gets stuck without much output.

localuser@singapura:~$ /usr/local/spark/spark-2.1.0-bin-hadoop2.7/bin/spark-shell --master yarn --deploy-mode client
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

I wait, wait and wait and nothing more is printed out. I have to kill it and erase the staging directory (if I don't erase it, I get WARN yarn.Client: Failed to cleanup staging dir the next time I call it).

来源:https://stackoverflow.com/questions/42987772/setting-yarn-shuffle-for-spark-makes-spark-shell-not-start

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!