I want to do something really basic, simply fire up a Spark cluster through the EMR console and run a Spark script that depends on a Python package (for example, Arrow). Wha
Depending if you are using Python 2 (default in EMR) or Python 3, the pip install command should be different. As recommended in noli's answer, you should create a shell script, upload it to a bucket in S3, and use it as a Bootstrap action.
For Python 2 (in Jupyter: used as default for pyspark kernel):
#!/bin/bash -xe
sudo pip install your_package
For Python 3 (in Jupyter: used as default for Python 3 and pyspark3 kernel):
#!/bin/bash -xe
sudo pip-3.4 install your_package