Deduce the HDFS path at runtime on EMR

自闭症网瘾萝莉.ら 提交于 2019-12-12 04:36:20

问题


I have spawned an EMR cluster with an EMR step to copy a file from S3 to HDFS and vice-versa using s3-dist-cp. This cluster is an on-demand cluster so we are not keeping track of the ip.

The first EMR step is: hadoop fs -mkdir /input - This step completed successfully.

The second EMR step is: Following is the command I am using:

s3-dist-cp --s3Endpoint=s3.amazonaws.com --src=s3://<bucket-name>/<folder-name>/sample.txt --dest=hdfs:///input - This step FAILED

I get the following exception Error:

Error: java.lang.IllegalArgumentException: java.net.UnknownHostException: sample.txt at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:378) at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:310) at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:176) at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:678) at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:619) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:149) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2717) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2751) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2733) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:377) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at com.amazon.elasticmapreduce.s3distcp.CopyFilesReducer.reduce(CopyFilesReducer.java:213) at com.amazon.elasticmapreduce.s3distcp.CopyFilesReducer.reduce(CopyFilesReducer.java:28) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:635) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: java.net.UnknownHostException: sample.txt

But this file does exist on S3 and I can read it through my spark application on EMR.


回答1:


The solution was while using s3-dist-cp , filename should not be mentioned in both source and destination.

If you want to filter files in the src directory, you can use --srcPattern option

eg: s3-dist-cp --s3Endpoint=s3.amazonaws.com --src=s3://// --dest=hdfs:///input/ --srcPattern=sample.txt.*



来源:https://stackoverflow.com/questions/43548832/deduce-the-hdfs-path-at-runtime-on-emr

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!