Pyspark: get list of files/directories on HDFS path

后端 未结 6 590
野趣味
野趣味 2020-12-05 07:14

As per title. I\'m aware of textFile but, as the name suggests, it works only on text files. I would need to access files/directories inside a path on either HD

6条回答
  •  孤城傲影
    2020-12-05 07:56

    I believe it's helpful to think of Spark only as a data processing tool, with a domain that begins at loading the data. It can read many formats, and it supports Hadoop glob expressions, which are terribly useful for reading from multiple paths in HDFS, but it doesn't have a builtin facility that I'm aware of for traversing directories or files, nor does it have utilities specific to interacting with Hadoop or HDFS.

    There are a few available tools to do what you want, including esutil and hdfs. The hdfs lib supports both CLI and API, you can jump straight to 'how do I list HDFS files in Python' right here. It looks like this:

    from hdfs import Config
    client = Config().get_client('dev')
    files = client.list('the_dir_path')
    

提交回复
热议问题