how to run a mapreduce job on amazon's elastic mapreduce (emr) cluster from windows?

谁都会走 提交于 2019-12-06 13:53:47

问题


i'm trying to learn how to run a java Map/Reduce (M/R) job on amazon's EMR. the documentation that i am following is here http://aws.amazon.com/articles/3938. i am on a windows 7 computer.

when i try to run this command, i am shown the help information.

./elasticmapreduce-client.rb RunJobFlow streaming_jobflow.json 

of course, since i am on a windows machine, i actually type in this command. i am not sure why, but for this particular command, there was not a windows version (all commands where shown in pairs, one for *nix and one for windows).

 ruby elastic-mapreduce RunJobFlow my_job.json

my question is how do we submit/run a job from windows to amazon's EMR using the command line interface (on windows)? i've tried searching online, but i get taken to wild places. any help is appreciated.

thanks.


回答1:


Hmmm. I'm not sure how old the example with RunJobFlow is... I'd personally ignore it.

Are you able to run?

localhost$ elastic-mapreduce --describe

Once you can then you should play directly on a cluster to shake out the exact steps you need to do... It's worth doing this so you don't have to start/stop a cluster a bazillion times.

localhost$ elastic-mapreduce --create --alive --num-instances 1
localhost$ elastic-mapreduce -j j-YOUR_ID_HERE --ssh

cluster$ hadoop jar my.jar -D some=1 -D args=1 blah blah
cluster$ hadoop jar some_other_jar.jar -D foo -D bar
cluster$ ^D

localhost$ elastic-mapreduce -j j-YOUR_ID_HERE --terminate

Then when you're happy with the steps and you need to have it run headless (say, from cron) you can have the EMR orchestrate the steps (including the cluster self terminating at the end)

localhost$ elastic-mapreduce --create --num-instances 1
localhost$ elastic-mapreduce --jar my_jar.jar --args "-D,some=1,-D,args=1,blah,blah"
localhost$ elastic-mapreduce --jar some_other_jar.jar --args "-D,foo,-D,bar"

I'd only explore the --json stuff if you need more complex steps, it's a bit cryptic and hard to get right first time...




回答2:


To run a streaming job on EMR, first you will need to create a cluster by a command like :

ruby elastic-mapreduce --create --alive --plain-output --master-instance-type m1.xlarge 
--slave-instance-type m1.xlarge --num-instances 6  --name "Some Job Cluster" --bootstrap-action s3://<path-to-a-bootstrap-script> 

This would return a jobid, which would look something like : j-ABCD7EF763

Now you can submit you job step by following command:

ruby elastic-mapreduce -j j-ABCD7EF763 --stream --step-name "my step name" --mapper
s3://<some-path>/mapper-script.rb --reducer s3://<some=path>/reducer-script.rb --input 
s3://<input-path> --output s3://<output-path> 

You can also direct run a job instead of running a streaming job, in which case the cluster will terminate itself when the job ends.




回答3:


Try using the --json option.

e.g. ./elastic-mapreduce --create --name Multisteps --json wordcount_jobflow.json

You will need to trim your json file with only the Steps (removing everything outside the []). There is a thread discussing that: https://forums.aws.amazon.com/thread.jspa?threadID=35093



来源:https://stackoverflow.com/questions/9621579/how-to-run-a-mapreduce-job-on-amazons-elastic-mapreduce-emr-cluster-from-wind

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!