Pyspark: get list of files/directories on HDFS path

后端 未结 6 591
野趣味
野趣味 2020-12-05 07:14

As per title. I\'m aware of textFile but, as the name suggests, it works only on text files. I would need to access files/directories inside a path on either HD

6条回答
  •  孤城傲影
    2020-12-05 07:47

    If you want to read in all files in a directory, check out sc.wholeTextFiles [doc], but note that the file's contents are read into the value of a single row, which is probably not the desired result.

    If you want to read only some files, then generating a list of paths (using a normal hdfs ls command plus whatever filtering you need) and passing it into sqlContext.read.text [doc] and then converting from a DataFrame to an RDD seems like the best approach.

提交回复
热议问题