Pyspark: get list of files/directories on HDFS path

后端 未结 6 589
野趣味
野趣味 2020-12-05 07:14

As per title. I\'m aware of textFile but, as the name suggests, it works only on text files. I would need to access files/directories inside a path on either HD

6条回答
  •  臣服心动
    2020-12-05 07:40

    Using JVM gateway maybe is not so elegant, but in some cases the code below could be helpful:

    URI           = sc._gateway.jvm.java.net.URI
    Path          = sc._gateway.jvm.org.apache.hadoop.fs.Path
    FileSystem    = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
    Configuration = sc._gateway.jvm.org.apache.hadoop.conf.Configuration
    
    
    fs = FileSystem.get(URI("hdfs://somehost:8020"), Configuration())
    
    status = fs.listStatus(Path('/some_dir/yet_another_one_dir/'))
    
    for fileStatus in status:
        print(fileStatus.getPath())
    

提交回复
热议问题