How to read a zip containing multiple files in Apache Spark

前端 未结 5 739
星月不相逢
星月不相逢 2020-12-06 18:41

I am having a Zipped file containing multiple text files. I want to read each of the file and build a List of RDD containining the content of each files.

val         


        
5条回答
  •  独厮守ぢ
    2020-12-06 19:13

    Here's a working version of @Atais solution (which needs enhancement by closing the streams) :

    implicit class ZipSparkContext(val sc: SparkContext) extends AnyVal {
    
    def readFile(path: String,
                 minPartitions: Int = sc.defaultMinPartitions): RDD[String] = {
    
      if (path.toLowerCase.contains("zip")) {
    
        sc.binaryFiles(path, minPartitions)
          .flatMap {
            case (zipFilePath, zipContent) ⇒
              val zipInputStream = new ZipInputStream(zipContent.open())
              Stream.continually(zipInputStream.getNextEntry)
                .takeWhile(_ != null)
                .map { _ ⇒
                  scala.io.Source.fromInputStream(zipInputStream, "UTF-8").getLines.mkString("\n")
                } #::: { zipInputStream.close; Stream.empty[String] }
          }
      } else {
        sc.textFile(path, minPartitions)
      }
    }
    }
    

    Then all you have to do is the following to read a zip file :

    sc.readFile(path)
    

提交回复
热议问题