Trying to run Spark on EMR using the AWS SDK for Java, but it skips the remote JAR stored on S3

99封情书 提交于 2019-12-13 03:56:50

问题


I'm trying to run Spark on EMR using the SDK for Java, but I'm having issues getting the spark-submit to use a JAR that I have stored on S3. Here is the relevant code:

public String launchCluster() throws Exception {
    StepFactory stepFactory = new StepFactory();

    // Creates a cluster flow step for debugging
    StepConfig enableDebugging = new StepConfig().withName("Enable debugging")
            .withActionOnFailure("TERMINATE_JOB_FLOW")
            .withHadoopJarStep(stepFactory.newEnableDebuggingStep());

    // Here is the original code before I tried command-runner.jar. 
    // When using this, I get a ClassNotFoundException for 
    // org.apache.spark.SparkConf. This is because for some reason, 
    // the super-jar that I'm generating doesn't include apache spark. 
    // Even so, I believe EMR should already have Spark installed if
    // I configure this correctly...

    //        HadoopJarStepConfig runExampleConfig = new HadoopJarStepConfig()
    //                .withJar(JAR_LOCATION)
    //                .withMainClass(MAIN_CLASS);

    HadoopJarStepConfig runExampleConfig = new HadoopJarStepConfig()
            .withJar("command-runner.jar")
            .withArgs(
                    "spark-submit",
                    "--master", "yarn",
                    "--deploy-mode", "cluster",
                    "--class", SOME_MAIN_CLASS,
                    SOME_S3_PATH_TO_SUPERJAR,
                    "-useSparkLocal", "false"
            );

    StepConfig customExampleStep = new StepConfig().withName("Example Step")
            .withActionOnFailure("TERMINATE_JOB_FLOW")
            .withHadoopJarStep(runExampleConfig);

    // Create Applications so that the request knows to launch
    // the cluster with support for Hadoop and Spark.

    // Unsure if Hadoop is necessary...
    Application hadoopApp = new Application().withName("Hadoop");
    Application sparkApp = new Application().withName("Spark");

    RunJobFlowRequest request = new RunJobFlowRequest().withName("spark-cluster")
            .withReleaseLabel("emr-5.15.0")
            .withSteps(enableDebugging, customExampleStep)
            .withApplications(hadoopApp, sparkApp)
            .withLogUri(LOG_URI)
            .withServiceRole("EMR_DefaultRole")
            .withJobFlowRole("EMR_EC2_DefaultRole")
            .withVisibleToAllUsers(true)
            .withInstances(new JobFlowInstancesConfig()
                    .withInstanceCount(3)
                    .withKeepJobFlowAliveWhenNoSteps(true)
                    .withMasterInstanceType("m3.xlarge")
                    .withSlaveInstanceType("m3.xlarge")
            );
    return result.getJobFlowId();
}

The steps complete without error, but it doesn't actually output anything...when I check the logs, stderr includes the following
Warning: Skip remote jar s3://somebucket/myservice-1.0-super.jar.
and
18/07/17 22:08:31 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.

I'm not sure what the issue is based on the log. I believe I am installing Spark correctly on the cluster. Also, to give some context - when I use withJar directly with the super-JAR stored on S3 instead of command-runner (and without withArgs), it correctly grabs the JAR, but then it doesn't have Spark installed - I get a ClassNotFoundException for SparkConf (and JavaSparkContext, depending on what my Spark job code tries to create first).

Any pointers would be much appreciated!


回答1:


I think that if your are using recent EMR release (emr-5.17.0 for instance), the --master parameter should be yarn-cluster instead of yarn in the runExampleConfigstatement. I had the same problem and, after this change, it works fine for me.



来源:https://stackoverflow.com/questions/51391911/trying-to-run-spark-on-emr-using-the-aws-sdk-for-java-but-it-skips-the-remote-j

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!