Copying/using Python files from S3 to Amazon Elastic MapReduce at bootstrap time

孤街浪徒 提交于 2019-12-08 13:37:43

问题


I've figured out how to install python packages (numpy and such) at the bootstrapping step using boto, as well as copying files from S3 to my EC2 instances, still with boto.

What I haven't figured out is how to distribute python scripts (or any file) from S3 buckets to each EMR instance using boto. Any pointers?


回答1:


If you are using boto, I recommend packaging all your Python files in an archive (.tar.gz format) and then using the cacheArchive directive in Hadoop/EMR to access it.

This is what I do:

  1. Put all necessary Python files in a sub-directory, say, "required/" and test it locally.
  2. Create an archive of this: cd required && tar czvf required.tgz *
  3. Upload this archive to S3: s3cmd put required.tgz s3://yourBucket/required.tgz
  4. Add this command-line option to your steps: -cacheArchive s3://yourBucket/required.tgz#required

The last step will ensure that your archive file containing Python code will be in the same directory format as in your local dev machine.

To actually do step #4 in boto, here is the code:

step = StreamingStep(name=jobName,
  mapper='...',
  reducer='...',
  ...
  cache_archives=["s3://yourBucket/required.tgz#required"],
)
conn.add_jobflow_steps(jobID, [step])

And to allow for the imported code in Python to work properly in your mapper, make sure to reference it as you would a sub-directory:

sys.path.append('./required')
import myCustomPythonClass

# Mapper: do something!


来源:https://stackoverflow.com/questions/18302759/copying-using-python-files-from-s3-to-amazon-elastic-mapreduce-at-bootstrap-time

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!