How to bootstrap installation of Python modules on Amazon EMR?

后端 未结 4 1086
孤独总比滥情好
孤独总比滥情好 2020-12-01 07:11

I want to do something really basic, simply fire up a Spark cluster through the EMR console and run a Spark script that depends on a Python package (for example, Arrow). Wha

4条回答
  •  借酒劲吻你
    2020-12-01 08:03

    In short, there are two ways to install packages with pip, depending on the platform. First, you install whatever you need and then you can run your Spark step. The easiest is to use emr-4.0.0 and 'command-runner.jar':

    from boto.emr.step import JarStep
    >>> pip_step=JarStep(name="Command Runner",
    ...             jar="command-runner.jar",
    ...             action_on_failure="CONTINUE",
    ...             step_args=['sudo','pip','install','arrow']
    ... )
    >>> spark_step=JarStep(name="Spark with Command Runner",
    ...                    jar="command-runner.jar",
    ...                    step_args=["spark-submit","/usr/lib/spark/examples/src/main/python/pi.py"]
    ...                    action_on_failure="CONTINUE"
    )
    >>> step_list=conn.add_jobflow_steps(emr.jobflowid, [pip_step,spark_step])
    

    On 2.x and 3.x, you use script-runner.jar in a similar fashion except that you have to specify the full URI for scriptrunner.

    EDIT: Sorry, I didn't see that you wanted to do this through console. You can add the same steps in the console as well. The first step would be a Customer JAR with the same args as above. The second step is a spark step. Hope this helps!

提交回复
热议问题