How does EMR handle an s3 bucket for input and output?

问题

I'm spinning up an EMR cluster and I've created the buckets specified in the EMR docs, but how should I upload data and read from it? In my spark submit step I say the script name using s3://myclusterbucket/scripts/script.py Is output not automatically uploaded to s3? How are dependencies handled? I've tried using the pyfiles pointing to a dependency zip inside the s3 bucket, but keep getting back 'file not found'

回答1:

MapReduce or Tez jobs in EMR can access S3 directly because of EMRFS (an AWS propriertary Hadoop filesystem implementation based on S3), e.g., in Apache Pig you can do loaded_data = LOAD 's3://mybucket/myfile.txt' USING PigStorage();

Not sure about Python-based Spark jobs. But one solution is to first copy the objects from S3 to the EMR HDFS, and then process them there.

There are multiple ways of doing the copy:

Use hadoop fs commands to copy objects from S3 to the EMR HDFS (and vice versa), e.g., hadoop fs -cp s3://mybucket/myobject hdfs://mypath_on_emr_hdfs
Use s3-dist-cp to copy objects from S3 to the EMR HDFS (and vice versa) http://docs.aws.amazon.com/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html

You can also use awscli (or hadoop fs -copyToLocal) to copy objects from S3 to the EMR master instance local disk (and vice versa), e.g., aws s3 cp s3://mybucket/myobject .

来源：https://stackoverflow.com/questions/47211002/how-does-emr-handle-an-s3-bucket-for-input-and-output

标签

python

apache-spark

amazon-emr

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!