Pyspark: get list of files/directories on HDFS path

后端 未结 6 592
野趣味
野趣味 2020-12-05 07:14

As per title. I\'m aware of textFile but, as the name suggests, it works only on text files. I would need to access files/directories inside a path on either HD

6条回答
  •  一个人的身影
    2020-12-05 08:01

    This might work for you:

    import subprocess, re
    def listdir(path):
        files = str(subprocess.check_output('hdfs dfs -ls ' + path, shell=True))
        return [re.search(' (/.+)', i).group(1) for i in str(files).split("\\n") if re.search(' (/.+)', i)]
    
    listdir('/user/')
    

    This also worked:

    hadoop = sc._jvm.org.apache.hadoop
    fs = hadoop.fs.FileSystem
    conf = hadoop.conf.Configuration()
    path = hadoop.fs.Path('/user/')
    [str(f.getPath()) for f in fs.get(conf).listStatus(path)]
    

提交回复
热议问题