how to run a mapreduce job on amazon's elastic mapreduce (emr) cluster from windows?

倾然丶 夕夏残阳落幕 提交于 2019-12-04 18:15:15

Hmmm. I'm not sure how old the example with RunJobFlow is... I'd personally ignore it.

Are you able to run?

localhost$ elastic-mapreduce --describe

Once you can then you should play directly on a cluster to shake out the exact steps you need to do... It's worth doing this so you don't have to start/stop a cluster a bazillion times.

localhost$ elastic-mapreduce --create --alive --num-instances 1
localhost$ elastic-mapreduce -j j-YOUR_ID_HERE --ssh

cluster$ hadoop jar my.jar -D some=1 -D args=1 blah blah
cluster$ hadoop jar some_other_jar.jar -D foo -D bar
cluster$ ^D

localhost$ elastic-mapreduce -j j-YOUR_ID_HERE --terminate

Then when you're happy with the steps and you need to have it run headless (say, from cron) you can have the EMR orchestrate the steps (including the cluster self terminating at the end)

localhost$ elastic-mapreduce --create --num-instances 1
localhost$ elastic-mapreduce --jar my_jar.jar --args "-D,some=1,-D,args=1,blah,blah"
localhost$ elastic-mapreduce --jar some_other_jar.jar --args "-D,foo,-D,bar"

I'd only explore the --json stuff if you need more complex steps, it's a bit cryptic and hard to get right first time...

To run a streaming job on EMR, first you will need to create a cluster by a command like :

ruby elastic-mapreduce --create --alive --plain-output --master-instance-type m1.xlarge 
--slave-instance-type m1.xlarge --num-instances 6  --name "Some Job Cluster" --bootstrap-action s3://<path-to-a-bootstrap-script> 

This would return a jobid, which would look something like : j-ABCD7EF763

Now you can submit you job step by following command:

ruby elastic-mapreduce -j j-ABCD7EF763 --stream --step-name "my step name" --mapper
s3://<some-path>/mapper-script.rb --reducer s3://<some=path>/reducer-script.rb --input 
s3://<input-path> --output s3://<output-path> 

You can also direct run a job instead of running a streaming job, in which case the cluster will terminate itself when the job ends.

Tony Fu

Try using the --json option.

e.g. ./elastic-mapreduce --create --name Multisteps --json wordcount_jobflow.json

You will need to trim your json file with only the Steps (removing everything outside the []). There is a thread discussing that: https://forums.aws.amazon.com/thread.jspa?threadID=35093

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!