Trying to run Spark on EMR using the AWS SDK for Java, but it skips the remote JAR stored on S3

问题

I'm trying to run Spark on EMR using the SDK for Java, but I'm having issues getting the spark-submit to use a JAR that I have stored on S3. Here is the relevant code:

public String launchCluster() throws Exception {
    StepFactory stepFactory = new StepFactory();

    // Creates a cluster flow step for debugging
    StepConfig enableDebugging = new StepConfig().withName("Enable debugging")
            .withActionOnFailure("TERMINATE_JOB_FLOW")
            .withHadoopJarStep(stepFactory.newEnableDebuggingStep());

    // Here is the original code before I tried command-runner.jar. 
    // When using this, I get a ClassNotFoundException for 
    // org.apache.spark.SparkConf. This is because for some reason, 
    // the super-jar that I'm generating doesn't include apache spark. 
    // Even so, I believe EMR should already have Spark installed if
    // I configure this correctly...

    //        HadoopJarStepConfig runExampleConfig = new HadoopJarStepConfig()
    //                .withJar(JAR_LOCATION)
    //                .withMainClass(MAIN_CLASS);

    HadoopJarStepConfig runExampleConfig = new HadoopJarStepConfig()
            .withJar("command-runner.jar")
            .withArgs(
                    "spark-submit",
                    "--master", "yarn",
                    "--deploy-mode", "cluster",
                    "--class", SOME_MAIN_CLASS,
                    SOME_S3_PATH_TO_SUPERJAR,
                    "-useSparkLocal", "false"
            );

    StepConfig customExampleStep = new StepConfig().withName("Example Step")
            .withActionOnFailure("TERMINATE_JOB_FLOW")
            .withHadoopJarStep(runExampleConfig);

    // Create Applications so that the request knows to launch
    // the cluster with support for Hadoop and Spark.

    // Unsure if Hadoop is necessary...
    Application hadoopApp = new Application().withName("Hadoop");
    Application sparkApp = new Application().withName("Spark");

    RunJobFlowRequest request = new RunJobFlowRequest().withName("spark-cluster")
            .withReleaseLabel("emr-5.15.0")
            .withSteps(enableDebugging, customExampleStep)
            .withApplications(hadoopApp, sparkApp)
            .withLogUri(LOG_URI)
            .withServiceRole("EMR_DefaultRole")
            .withJobFlowRole("EMR_EC2_DefaultRole")
            .withVisibleToAllUsers(true)
            .withInstances(new JobFlowInstancesConfig()
                    .withInstanceCount(3)
                    .withKeepJobFlowAliveWhenNoSteps(true)
                    .withMasterInstanceType("m3.xlarge")
                    .withSlaveInstanceType("m3.xlarge")
            );
    return result.getJobFlowId();
}

The steps complete without error, but it doesn't actually output anything...when I check the logs, stderr includes the following
Warning: Skip remote jar s3://somebucket/myservice-1.0-super.jar.
and
18/07/17 22:08:31 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.

I'm not sure what the issue is based on the log. I believe I am installing Spark correctly on the cluster. Also, to give some context - when I use withJar directly with the super-JAR stored on S3 instead of command-runner (and without withArgs), it correctly grabs the JAR, but then it doesn't have Spark installed - I get a ClassNotFoundException for SparkConf (and JavaSparkContext, depending on what my Spark job code tries to create first).

Any pointers would be much appreciated!

回答1:

I think that if your are using recent EMR release (emr-5.17.0 for instance), the --master parameter should be yarn-cluster instead of yarn in the runExampleConfigstatement. I had the same problem and, after this change, it works fine for me.

来源：https://stackoverflow.com/questions/51391911/trying-to-run-spark-on-emr-using-the-aws-sdk-for-java-but-it-skips-the-remote-j

标签

apache-spark

amazon-s3

amazon-ec2

jar

amazon-emr