How to use -libjars on aws emr?

佐手、 提交于 2019-12-01 10:36:13

Can you try creating FatJar and Run. Try to create one jar with dependency added and then Run with EMR. It will work.

in ant build you can use as below

< zip destfile="/lib/abc-fatjar.jar" >

< zipgroupfileset dir="lib" includes="jobcustomjar.jar,json-simple-1.1.1.jar" />

< /zip >

For a Hadoop Streaming job where you can't bundle your code into one big Jar, you can use the following trick (In my case, I created my own Java classes for custom input and output formats. For custom splitters or whatever else, this same trick would apply):

  1. Create a Jar containing your custom classes

  2. Upload the Jar to S3:

    aws s3 cp myjar.jar s3://mybucket/myjar.jar
    
  3. Create a shell script that fetches the Jar and copies it to the Master node:

    #!/bin/bash
    hadoop fs -copyToLocal s3://mybucket/myjar.jar /home/hadoop/myjar.jar
    
  4. Upload the shell script to S3:

    aws s3 cp jar_fetcher.sh s3://mybucket/jar_fetcher.sh
    
  5. When creating your EMR job, run your jar-fetcher script before your streaming job:

    elastic-mapreduce --create \
      --ami-version "3.3.1" \
      --name "My EMR Job using -libjars" \
      --num-instances 3 \
      --master-instance-type "m3.xlarge"  --slave-instance-type "m3.xlarge" \
      --script s3://mybucket/jar_fetcher.sh \
        --step-name "Jar fetcher for -libjars" \
      --stream \
        --args "-libjars,/home/hadoop/myjar.jar" \
        --args "-D,org.apache.hadoop.mapreduce.lib.input.FileInputFormat=my.custom.InputFormat" \
        --args "-outputformat,my.custom.OutputFormat" \
        --arg "-files" \
        --arg "s3://mybucket/some_job.py,s3://mybucket/some_utils.py" \
        --mapper "python some_job.py --someArg" \
        --reducer NONE \
        --input s3://mybucket/someData \
        --output s3://mybucket/results/someJob \
        --step-name "Streaming step with -libjars"
    

Here's what I did to add to Sandesh's answer to build a jar. Then run ant build-jar

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<project name="alert">
  <target name="build-jar">
    <jar destfile="lib/fatjar.jar"
     basedir="classes">
      <manifest>
    <attribute name="Main-Class" value="alert.Alert"/>
      </manifest>
      <zipgroupfileset dir="." includes="json-simple-1.1.1.jar" />
    </jar>
  </target>
</project>

Then after specifying path to fatjar.jar in EMR , used the following as arguments.

-D mapred.output.compress=true -D mapred.output.compression.type=BLOCK -D io.seqfile.compression.type=BLOCK -D mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec s3n://akshayhazari/rule/rule1.json s3n://akshayhazari/Alert/input/data.txt.gz s3n://akshayhazari/Alert/input/data1.txt.gz s3n://akshayhazari/Alert/output
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!