How to specify mapred configurations & java options with custom jar in CLI using Amazon's EMR?

后端 未结 2 1978
终归单人心
终归单人心 2021-02-06 08:42

I would like to know how to specify mapreduce configurations such as mapred.task.timeout , mapred.min.split.size etc. , when running a streaming job using custo

相关标签:
2条回答
  • 2021-02-06 09:18

    I believe if you want to set these on a per-job basis, then you need to

    A) for custom Jars, pass them into your jar as arguments, and process them yourself. I believe this can be automated as follows:

    public static void main(String[] args) throws Exception {
      Configuration conf = new Configuration();
      args = new GenericOptionsParser(conf, args).getRemainingArgs();
      //....
    }
    

    Then create the job in this manner (haven't verified if works though):

     > elastic-mapreduce --jar s3://mybucket/mycode.jar \
        --args "-D,mapred.reduce.tasks=0"
        --arg s3://mybucket/input \
        --arg s3://mybucket/output
    

    The GenericOptionsParser should automatically transfer the -D and -jobconf parameters into Hadoop's job setup. More details: http://hadoop.apache.org/docs/r0.20.0/api/org/apache/hadoop/util/GenericOptionsParser.html

    B) for the hadoop streaming jar, you also just pass the configuration change to the command

    > elastic-mapreduce --jobflow j-ABABABABA \
       --stream --jobconf mapred.task.timeout=600000 \
       --mapper s3://mybucket/mymapper.sh \
       --reducer s3://mybucket/myreducer.sh \
       --input s3://mybucket/input \
       --output s3://mybucket/output \
       --jobconf mapred.reduce.tasks=0
    

    More details: https://forums.aws.amazon.com/thread.jspa?threadID=43872 and elastic-mapreduce --help

    0 讨论(0)
  • 2021-02-06 09:39

    In the context of Amazon Elastic MapReduce (Amazon EMR), you are looking for Bootstrap Actions:

    Bootstrap actions allow you to pass a reference to a script stored in Amazon S3. This script can contain configuration settings and arguments related to Hadoop or Elastic MapReduce. Bootstrap actions are run before Hadoop starts and before the node begins processing data. [emphasis mine]

    Section Running Custom Bootstrap Actions from the CLI provides a generic usage example:

    & ./elastic-mapreduce --create --stream --alive \
    --input s3n://elasticmapreduce/samples/wordcount/input \
    --mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py \
    --output s3n://myawsbucket 
    --bootstrap-action s3://elasticmapreduce/bootstrap-actions/download.sh  
    

    In particular, there are separate bootstrap actions to configure Hadoop and Java:

    Hadoop (cluster)

    You can specify Hadoop settings via bootstrap action Configure Hadoop, which allows you to set cluster-wide Hadoop settings, for example:

    $ ./elastic-mapreduce --create \
    --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop \
    --args "--site-config-file,s3://myawsbucket/config.xml,-s,mapred.task.timeout=0"     
    

    Java (JVM)

    You can specify custom JVM settings via bootstrap action Configure Daemons:

    This predefined bootstrap action lets you specify the heap size or other Java Virtual Machine (JVM) options for the Hadoop daemons. You can use this bootstrap action to configure Hadoop for large jobs that require more memory than Hadoop allocates by default. You can also use this bootstrap action to modify advanced JVM options, such as garbage collection behavior.

    The provided example sets the heap size to 2048 and configures the Java namenode option:

    $ ./elastic-mapreduce –create –alive \
      --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-daemons \
      --args --namenode-heap-size=2048,--namenode-opts=-XX:GCTimeRatio=19   
    
    0 讨论(0)
提交回复
热议问题