Get a list of subdirectories

大城市里の小女人 提交于 2019-12-12 03:47:26

问题


I know I can do this:

data = sc.textFile('/hadoop_foo/a')
data.count()
240
data = sc.textFile('/hadoop_foo/*')
data.count()
168129

However, I would like to count the size of the data of every subdirectory of "/hadoop_foo/". Can I do that?

In other words, what I want is something like this:

subdirectories = magicFunction()
for subdir in subdirectories:
  data sc.textFile(subdir)
  data.count()

I tried with:

In [9]: [x[0] for x in os.walk("/hadoop_foo/")]
Out[9]: []

but I think that fails, because it searches at the local directory of the driver (the gateway in that case), while "/hadoop_foo/" lies in the hdfs. Same for "hdfs:///hadoop_foo/".


After reading How can I list subdirectories recursively for HDFS?, I am wondering if there is a way to execute:

hadoop dfs -lsr /hadoop_foo/

in code..


From Correct way of writing two floats into a regular txt:

In [28]: os.getcwd()
Out[28]: '/homes/gsamaras'  <-- which is my local directory

回答1:


With python use hdfs module; walk() method can get you list of files.

The code sould look something like this:

from hdfs import InsecureClient

client = InsecureClient('http://host:port', user='user')
for stuff in client.walk(dir, 0, True):
...

With Scala you can get the filesystem (val fs = FileSystem.get(new Configuration())) and run https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/fs/FileSystem.html#listFiles(org.apache.hadoop.fs.Path, boolean)

You can also execute a shell command from your script with os.subprocess but this is never a recommended approach since you depend on text output of a shell utility here.


Eventually, what worked for the OP was using subprocess.check_output():

subdirectories = subprocess.check_output(["hadoop","fs","-ls", "/hadoop_foo/"])


来源:https://stackoverflow.com/questions/39420685/get-a-list-of-subdirectories

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!