问题
I'm spinning up an EMR cluster and I've created the buckets specified in the EMR docs, but how should I upload data and read from it? In my spark submit step I say the script name using s3://myclusterbucket/scripts/script.py
Is output not automatically uploaded to s3? How are dependencies handled? I've tried using the pyfiles pointing to a dependency zip inside the s3 bucket, but keep getting back 'file not found'
回答1:
MapReduce or Tez jobs in EMR can access S3 directly because of EMRFS (an AWS propriertary Hadoop filesystem implementation based on S3), e.g., in Apache Pig you can do
loaded_data = LOAD 's3://mybucket/myfile.txt' USING PigStorage();
Not sure about Python-based Spark jobs. But one solution is to first copy the objects from S3 to the EMR HDFS, and then process them there.
There are multiple ways of doing the copy:
Use
hadoop fs
commands to copy objects from S3 to the EMR HDFS (and vice versa), e.g.,hadoop fs -cp s3://mybucket/myobject hdfs://mypath_on_emr_hdfs
Use s3-dist-cp to copy objects from S3 to the EMR HDFS (and vice versa) http://docs.aws.amazon.com/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html
You can also use awscli (or hadoop fs -copyToLocal
) to copy objects from S3 to the EMR master instance local disk (and vice versa), e.g., aws s3 cp s3://mybucket/myobject .
来源:https://stackoverflow.com/questions/47211002/how-does-emr-handle-an-s3-bucket-for-input-and-output