how to install custom packages on amazon EMR bootstrap action in code?

我只是一个虾纸丫 提交于 2019-12-07 05:19:34

问题


need to install some packages and binaries on the amazon EMR bootstrap action but I can't find any example that uses this.

Basically, I want to install python package, and specify each hadoop node to use this package for processing the items in s3 bucket, here's a sample frpm boto.

                      name='Image to grayscale using SimpleCV python package',
                      mapper='s3n://elasticmapreduce/samples/imageGrayScale.py',
                      reducer='aggregate',
                      input='s3n://elasticmapreduce/samples/input',
                      output='s3n://<my output bucket>/output'

I need to make it use the SimpleCV python package, but not sure where to specify this. What if it is not installed, how do I make it installed? Is there a way to avoid waiting for the installation to complete, is it possible to install it somewhere and just reference the python package?


回答1:


There is a class boto.emr.bootstrap_action.BootstrapAction for the bootstrap action.

Define it like the below. Most of the code is from the boto example page.

import boto.emr
from boto.emr.bootstrap_action import BootstrapAction

action = BootstrapAction(name="Bootstrap to add SimpleCV",
                         path="s3n://<my bucket uri>/bootstrap-simplecv.sh")

conn = boto.emr.connect_to_region('us-west-2')
jobid = conn.run_jobflow(name='My jobflow',
                         log_uri='s3://<my log uri>/jobflow_logs',
                         steps=[step],  # step defined elsewhere
                         bootstrap_actions=[action])

And you need to define the bootstrap action. If you need another version of Python then yes, it would save time to precompile it on the exact same computer, tar it, put it in an S3 bucket, and then untar it during the bootstrap.

#!/bin/sh
# filename: bootstrap-simplecv.sh  (save it in an S3 bucket)
set -e -x

sudo apt-get install python-setuptools
sudo easy_install pip 
sudo pip install -U SimpleCV

I think you can leave EMR instances spinning from within boto so that the bootstrap only occurs the first time in your session. Just be careful to shut them down before you log out so you don't get a surprise on your bill.



来源:https://stackoverflow.com/questions/23168663/how-to-install-custom-packages-on-amazon-emr-bootstrap-action-in-code

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!