Scala & DataBricks: Getting a list of Files

久未见 提交于 2020-02-04 22:58:26

问题


I am trying to make a list of files in an S3 bucket on Databricks within Scala, and then split by regex. I am very new to Scala. The python equivalent would be

all_files = map(lambda x: x.path, dbutils.fs.ls(folder))
filtered_files = filter(lambda name: True if pattern.match(name) else False, all_files)

but I want to do this in Scala.

From https://alvinalexander.com/scala/how-to-list-files-in-directory-filter-names-scala

import java.io.File
def getListOfFiles(dir: String):List[File] = {
    val d = new File(dir)
    if (d.exists && d.isDirectory) {
        d.listFiles.filter(_.isFile).toList
    } else {
        List[File]()
    }
}

However, this produces an empty list.

I've also thought of

var all_files: List[Any] = List(dbutils.fs.ls("s3://bucket"))

but this produces a list of things like (with length 1)

all_files: List[Any] = List(WrappedArray(FileInfo(s3://bucket/.internal_name.pl.swp, .internal_name.pl.swp, 12288), FileInfo(s3://bucket/file0, 10223616), FileInfo(s3://bucket/, file1, 0), ....)

which has a length of 1. I cannot turn this into a dataframe, as suggested by How to iterate scala wrappedArray? (Spark) This isn't usable.

How can I generate a list of files in Scala, and then iterate through them?


回答1:


You should do :

val name : String = ???   
val all_files : Seq[String] = dbutils.fs.ls("s3://bucket").map(_.path).filter(_.matches(name))


来源:https://stackoverflow.com/questions/52650777/scala-databricks-getting-a-list-of-files

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!