copy files from amazon s3 to hdfs using s3distcp fails

独自空忆成欢 提交于 2019-12-06 04:00:01

问题


I am trying to copy files from s3 to hdfs using workflow in EMR and when I run the below command the jobflow successfully starts but gives me an error when it tries to copy the file to HDFS .Do i need to set any input file permissions ?

Command:

./elastic-mapreduce --jobflow j-35D6JOYEDCELA --jar s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar --args '--src,s3://odsh/input/,--dest,hdfs:///Users

Output

Task TASKID="task_201301310606_0001_r_000000" TASK_TYPE="REDUCE" TASK_STATUS="FAILED" FINISH_TIME="1359612576612" ERROR="java.lang.RuntimeException: Reducer task failed to copy 1 files: s3://odsh/input/GL_01112_20121019.dat etc at com.amazon.external.elasticmapreduce.s3distcp.CopyFilesReducer.close(CopyFilesReducer.java:70) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:538) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:429) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132) at org.apache.hadoop.mapred.Child.main(Child.java:249)


回答1:


I'm getting the same exception. It looks like the bug is caused by a race condition when CopyFilesReducer uses multiple CopyFilesRunable instances to download the files from S3. The problem is that it uses the same temp directory in multiple threads, and the threads delete the temp directory when they're done. Hence, when one thread completes before another it deletes the temp directory that another thread is still using.

I've reported the problem to AWS, but in the mean time you can work around the bug by forcing the reducer to use a single thread by setting the variable s3DistCp.copyfiles.mapper.numWorkers to 1 in your job config.




回答2:


I see this same problem caused by race condition. Passing -Ds3DistCp.copyfiles.mapper.numWorkers=1 helps avoid the problem.

I hope Amazon fixes this bug.




回答3:


Adjusting the number of workers didn't work for me; s3distcp always failed on a small/medium instance. Increasing the heap size of the task job (via -D mapred.child.java.opts=-Xmx1024m) solved it for me.

Example usage:

hadoop jar /home/hadoop/lib/emr-s3distcp-1.0.jar 
    -D mapred.child.java.opts=-Xmx1024m 
    --src s3://source/
    --dest hdfs:///dest/ --targetSize 128
    --groupBy '.*\.([0-9]+-[0-9]+-[0-9]+)-[0-9]+\..*' 
    --outputCodec gzip



回答4:


The problem is the map - reduce jobs fail. Mapper execute perfectly but reducers create a bottle neck at the clusters memory.

THIS SOLVED for me -Dmapreduce.job.reduces=30 if it still fails try to

reduce it to 20 i.e. -Dmapreduce.job.reduces=20

I'll add the entire argument for ease of understanding:

In AWS Cluster:

JAR location : command-runner.jar

Main class : None

Arguments : s3-dist-cp -Dmapreduce.job.reduces=30 --src=hdfs:///user/ec2-user/riskmodel-output --dest=s3://dev-quant-risk-model/2019_03_30_SOM_EZ_23Factors_Constrained_CSR_Stats/output --multipartUploadChunkSize=1000

Action on failure: Continue

in script file:

aws --profile $AWS_PROFILE emr add-steps --cluster-id $CLUSTER_ID --steps Type=CUSTOM_JAR,Jar='command-runner.jar',Name="Copy Model Output To S3",ActionOnFailure=CONTINUE,Args=[s3-dist-cp,-Dmapreduce.job.reduces=20,--src=$OUTPUT_BUCKET,--dest=$S3_OUTPUT_LARGEBUCKET,--multipartUploadChunkSize=1000]



来源:https://stackoverflow.com/questions/14631152/copy-files-from-amazon-s3-to-hdfs-using-s3distcp-fails

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!