I have read about Spark\'s support for gzip-kind input files here, and I wonder if the same support exists for different kind of compressed files, such as
You can use sc.binaryFiles to read Zip as binary file
val rdd = sc.binaryFiles(path).flatMap {
case (name: String, content: PortableDataStream) => new ZipInputStream(content.open)
} //=> RDD[ZipInputStream]
And then you can map the ZipInputStream to list of lines:
val zis = rdd.first
val entry = zis.getNextEntry
val br = new BufferedReader(new InputStreamReader(in, "UTF-8"))
val res = Stream.continually(br.readLine()).takeWhile(_ != null).toList
But the problem remains that the zip file is not splittable.