Multiple source files for s3distcp

情到浓时终转凉″ 提交于 2019-12-12 04:43:39

问题


Is there a way to copy a list of files from S3 to hdfs instead of complete folder using s3distcp? this is when srcPattern can not work.

I have multiple files on a s3 folder all having different names. I want to copy only specific files to a hdfs directory. I did not find any way to specify multiple source files path to s3distcp.

Workaround that I am currently using is to tell all the file names in srcPattern

hadoop jar s3distcp.jar
    --src s3n://bucket/src_folder/
    --dest hdfs:///test/output/
    --srcPattern '.*somefile.*|.*anotherone.*'

Can this thing work when the number of files is too many? like around 10 000?


回答1:


Yes you can. create a manifest file with all the files you need and use --copyFromManifest option as mentioned here




回答2:


hadoop distcp should solve your problem. we can use distcp to copy data from s3 to hdfs.

And it also supports wildcards and we can provide multiple source paths in the command.

http://hadoop.apache.org/docs/r1.2.1/distcp.html

Go through the usage section in this particular url

Example: consider you have the following files in s3 bucket(test-bucket) inside test1 folder.

abc.txt
abd.txt
defg.txt

And inside test2 folder you have

hijk.txt
hjikl.txt
xyz.txt

And your hdfs path is hdfs://localhost.localdomain:9000/user/test/

Then distcp command is as follows for a particular pattern.

hadoop distcp s3n://test-bucket/test1/ab*.txt \ s3n://test-bucket/test2/hi*.txt hdfs://localhost.localdomain:9000/user/test/


来源:https://stackoverflow.com/questions/26273181/multiple-source-files-for-s3distcp

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!