Zip support in Apache Spark

后端未结

关注

 5  1981

时光取名叫无心 2020-12-03 15:02

I have read about Spark\'s support for gzip-kind input files here, and I wonder if the same support exists for different kind of compressed files, such as

5条回答

执笔经年 (楼主)

2020-12-03 15:29
Below is an example which searches a directory for .zip files and create an RDD using a custom FileInputFormat called ZipFileInputFormat and the newAPIHadoopFile API on the Spark Context. It then writes those files to an output directory.
```
allzip.foreach { x =>
  val zipFileRDD = sc.newAPIHadoopFile(
    x.getPath.toString,
    classOf[ZipFileInputFormat],
    classOf[Text],
    classOf[BytesWritable], hadoopConf)

  zipFileRDD.foreach { y =>
    ProcessFile(y._1.toString, y._2)
  }
```
https://github.com/alvinhenrick/apache-spark-examples/blob/master/src/main/scala/com/zip/example/Unzip.scala

The ZipFileInputFormat used in the example can be found here: https://github.com/cotdp/com-cotdp-hadoop/tree/master/src/main/java/com/cotdp/hadoop
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...