Reading multiple csv files at different folder depths

自作多情 提交于 2019-12-12 12:57:50

问题


I want to recursively read all csv files in a given folder into a Spark SQL DataFrame using a single path, if possible.

My folder structure looks something like this and I want to include all of the files with one path:

  1. resources/first.csv
  2. resources/subfolder/second.csv
  3. resources/subfolder/third.csv

This is my code:

def read: DataFrame =
      sparkSession
        .read
        .option("header", "true")
        .option("inferSchema", "true")
        .option("charset", "UTF-8")
        .csv(path)

Setting path to .../resource/*/*.csv omits 1. while .../resource/*.csv omits 2. and 3.

I know csv() also takes multiple strings as path arguments, but want to avoid that, if possible.

note: I know my question is similar to How to import multiple csv files in a single load?, except that I want to include files of all contained folders, independent of their location within the main folder.


回答1:


If there are only csv files and only one level of subfolder in your resources directory then you can use resources/**.

EDIT

Else you can use Hadoop FileSystem class to recursively list every csv files in your resources directory and then pass the list to .csv()

    val fs = FileSystem.get(new Configuration())
    val files = fs.listFiles(new Path("resources/", true))
    val filePaths = new ListBuffer[String]
    while (files.hasNext()) {
        val file = files.next()
        filePaths += file.getPath.toString
    }

    val df: DataFrame = spark
        .read
        .options(...)
        .csv(filePaths: _*)


来源:https://stackoverflow.com/questions/43043797/reading-multiple-csv-files-at-different-folder-depths

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!