list S3 folder on EMR

不想你离开。 提交于 2019-12-10 10:42:13

问题


I fail to understand how to simply list the contents of an S3 bucket on EMR during a spark job. I wanted to do the following

Configuration conf = spark.sparkContext().hadoopConfiguration();
FileSystem s3 = S3FileSystem.get(conf);
List<LocatedFileStatus> list = toList(s3.listFiles(new Path("s3://mybucket"), false))

This always fails with the following error

java.lang.IllegalArgumentException: Wrong FS: s3://*********/, expected: hdfs://**********.eu-central-1.compute.internal:8020

in the hadoopConfiguration fs.defaultFS -> hdfs://**********.eu-central-1.compute.internal:8020

The way I understand it if I don't use a protocol just /myfolder/myfile instead of i.e. hdfs://myfolder/myfile it will default to the df.defaultFS. But I would expect if I specify my s3://mybucket/ the fs.defaultFS should not matter.

How does one access the directory information? spark.read.parquet("s3://mybucket/*.parquet") works just fine but for this task I need to check the existence of some files and would also like to delete some. I assumed org.apache.hadoop.fs.FileSystem would be the correct tool.

PS: I also don't understand how logging works. If I use deploy-mode cluster (i want to deploy jars from s3 which does not work in client mode), the I can only find my logs in s3://logbucket/j-.../containers/application.../conatiner...0001. There is quite a long delay before those show in S3. How do I find it via ssh on the master? or is there some faster/better way to check spark application logs? UPDATE: Just found them under /mnt/var/log/hadoop-yarn/containers however the it is owned by yarn:yarn and as hadoop user I cannot read it. :( Ideas?


回答1:


I don't think you are picking up the FS right; just use the static FileSystem.get() method, or Path.get()

Try something like: Path p = new Path("s3://bucket/subdir"); FileSystem fs = p.get(conf); FileStatus[] status= fs.listStatus(p);

Regarding logs, YARN UI should let you at them via the node managers.



来源:https://stackoverflow.com/questions/43980302/list-s3-folder-on-emr

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!