Spark : Read file only if the path exists

前端 未结 2 1639
無奈伤痛
無奈伤痛 2020-12-16 12:45

I am trying to read the files present at Sequence of Paths in scala. Below is the sample (pseudo) code:

val paths = Seq[String] //Seq of paths
v         


        
相关标签:
2条回答
  • 2020-12-16 13:01

    How about filtering the paths firstly`:

    paths.filter(f => new java.io.File(f).exists)
    

    For instance:

    Seq("/tmp", "xx").filter(f => new java.io.File(f).exists)
    // res18: List[String] = List(/tmp)
    
    0 讨论(0)
  • 2020-12-16 13:07

    You can filter out the irrelevant files as in @Psidom's answer. In spark, the best way to do so is to use the internal spark hadoop configuration. Given that spark session variable is called "spark" you can do:

    import org.apache.hadoop.fs.FileSystem
    import org.apache.hadoop.fs.Path
    
    val hadoopfs: FileSystem = FileSystem.get(spark.sparkContext.hadoopConfiguration)
    
    def testDirExist(path: String): Boolean = {
      val p = new Path(path)
      hadoopfs.exists(p) && hadoopfs.getFileStatus(p).isDirectory
    }
    val filteredPaths = paths.filter(p => testDirExists(p))
    val dataframe = spark.read.parquet(filteredPaths: _*)
    
    0 讨论(0)
提交回复
热议问题