Property spark.yarn.jars - how to deal with it?

后端 未结 3 634
遥遥无期
遥遥无期 2020-12-12 23:12

My knowledge with Spark is limited and you would sense it after reading this question. I have just one node and spark, hadoop and yarn are installed on it.

I was abl

相关标签:
3条回答
  • 2020-12-12 23:23

    If you look at spark.yarn.jars documentation it says the following

    List of libraries containing Spark code to distribute to YARN containers. By default, Spark on YARN will use Spark jars installed locally, but the Spark jars can also be in a world-readable location on HDFS. This allows YARN to cache it on nodes so that it doesn't need to be distributed each time an application runs. To point to jars on HDFS, for example, set this configuration to hdfs:///some/path. Globs are allowed.

    This means that you are actually overriding the SPARK_HOME/jars and telling yarn to pick up all the jars required for the application run from your path,If you set spark.yarn.jars property, all the dependent jars for spark to run should be present in this path, If you go and look inside spark-assembly.jar present in SPARK_HOME/lib , org.apache.spark.deploy.yarn.ApplicationMaster class is present, so make sure that all the spark dependencies are present in the HDFS path that you specify as spark.yarn.jars.

    0 讨论(0)
  • 2020-12-12 23:28

    I was finally able to make sense of this property. I found by hit-n-trial that correct syntax of this property is

    spark.yarn.jars=hdfs://xx:9000/user/spark/share/lib/*.jar

    I didn't put *.jar in the end and my path was just ended with /lib. I tried putting actual assembly jar like this - spark.yarn.jars=hdfs://sanjeevd.brickred:9000/user/spark/share/lib/spark-yarn_2.11-2.0.1.jar but no luck. All it said that unable to load ApplicationMaster.

    I posted my response to a similar question asked by someone at https://stackoverflow.com/a/41179608/2332121

    0 讨论(0)
  • 2020-12-12 23:31

    You could also use the spark.yarn.archive option and set that to the location of an archive (you create) containing all the JARs in the $SPARK_HOME/jars/ folder, at the root level of the archive. For example:

    1. Create the archive: jar cv0f spark-libs.jar -C $SPARK_HOME/jars/ .
    2. Upload to HDFS: hdfs dfs -put spark-libs.jar /some/path/.
      2a. For a large cluster, increase the replication count of the Spark archive so that you reduce the amount of times a NodeManager will do a remote copy. hdfs dfs –setrep -w 10 hdfs:///some/path/spark-libs.jar (Change the amount of replicas proportional to the number of total NodeManagers)
    3. Set spark.yarn.archive to hdfs:///some/path/spark-libs.jar
    0 讨论(0)
提交回复
热议问题