s3-dist-cp and hadoop distcp job infinitely loopin in EMR

喜夏-厌秋 提交于 2019-12-25 07:44:58

问题


I'm trying to copy 193 GB data from s3 to HDFS. I'm running the following commands for s3-dist-cp and hadoop distcp:

s3-dist-cp --src s3a://PathToFile/file1 --dest hdfs:///user/hadoop/S3CopiedFiles/

hadoop distcp s3a://PathToFile/file1 hdfs:///user/hadoop/S3CopiedFiles/

I'm running these on the master node and also keeping a check on the amount being transferred. It took about an hour and after copying it over, everything gets erased, disk space is shown as 99.8% in the 4 core instances in my cluster, and the hadoop job runs forever. As soon as i run the command,

16/07/18 18:43:55 INFO mapreduce.Job: map 0% reduce 0%
16/07/18 18:44:02 INFO mapreduce.Job: map 100% reduce 0%
16/07/18 18:44:08 INFO mapreduce.Job: map 100% reduce 14%
16/07/18 18:44:11 INFO mapreduce.Job: map 100% reduce 29%
16/07/18 18:44:13 INFO mapreduce.Job: map 100% reduce 86%
16/07/18 18:44:18 INFO mapreduce.Job: map 100% reduce 100%

This gets printed immediately and then copies over data for an hour. It starts all over again.

16/07/18 19:52:45 INFO mapreduce.Job: map 0% reduce 0%
16/07/18 18:52:53 INFO mapreduce.Job: map 100% reduce 0%

Am i missing anything here? Any help is appreciated.

Also I would like to know where can i find the log files on the master node to see if the job is failing and hence looping? Thanks


回答1:


In my case, I copy a single large compressed file from hdfs to s3, and hadoop distcp is much faster then s3-dist-cp.

When I check log, multi uploading part takes very long time at reduce step. Uploading a block(134MB) takes 20 secs for s3-dist-cp, while it takes only 4 secs for hadoop distcp.

Difference between distcp and s3-dist-cp is distcp creates temp files at s3(at destination file system), while s3-dist-cp creates temp files at hdfs.

I am still investigating why multi uploading performance is much different with distcp and s3-dist-cp, hope some one with good insights can contribute here.




回答2:


If you could pick up Hadoop 2.8.0 for your investigations, and use s3a:// filesystem, you can grab lots of filesystem statistics it now collects.

A real performance killer is rename(), which is mimicked in the s3 clients by doing a copy then a delete: if either distcp run is trying to do atomic distcp with renames, that'll add a delay of about 1 second for every 6-10MB of data. that 134MB for 16s of post-upload delay would go with the "it's a rename"



来源:https://stackoverflow.com/questions/38462480/s3-dist-cp-and-hadoop-distcp-job-infinitely-loopin-in-emr

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!