Zip support in Apache Spark

后端 未结 5 1979
时光取名叫无心
时光取名叫无心 2020-12-03 15:02

I have read about Spark\'s support for gzip-kind input files here, and I wonder if the same support exists for different kind of compressed files, such as

5条回答
  •  离开以前
    2020-12-03 15:25

    You can use sc.binaryFiles to read Zip as binary file

    val rdd = sc.binaryFiles(path).flatMap { 
        case (name: String, content: PortableDataStream) => new ZipInputStream(content.open) 
    }  //=> RDD[ZipInputStream]
    

    And then you can map the ZipInputStream to list of lines:

    val zis = rdd.first
    val entry = zis.getNextEntry
    val br = new BufferedReader(new InputStreamReader(in, "UTF-8"))
    val res = Stream.continually(br.readLine()).takeWhile(_ != null).toList
    

    But the problem remains that the zip file is not splittable.

提交回复
热议问题