Spark : Read file only if the path exists

前端 未结 2 1644
無奈伤痛
無奈伤痛 2020-12-16 12:45

I am trying to read the files present at Sequence of Paths in scala. Below is the sample (pseudo) code:

val paths = Seq[String] //Seq of paths
v         


        
2条回答
  •  别那么骄傲
    2020-12-16 13:07

    You can filter out the irrelevant files as in @Psidom's answer. In spark, the best way to do so is to use the internal spark hadoop configuration. Given that spark session variable is called "spark" you can do:

    import org.apache.hadoop.fs.FileSystem
    import org.apache.hadoop.fs.Path
    
    val hadoopfs: FileSystem = FileSystem.get(spark.sparkContext.hadoopConfiguration)
    
    def testDirExist(path: String): Boolean = {
      val p = new Path(path)
      hadoopfs.exists(p) && hadoopfs.getFileStatus(p).isDirectory
    }
    val filteredPaths = paths.filter(p => testDirExists(p))
    val dataframe = spark.read.parquet(filteredPaths: _*)
    

提交回复
热议问题