Zip support in Apache Spark

后端 未结 5 1981
时光取名叫无心
时光取名叫无心 2020-12-03 15:02

I have read about Spark\'s support for gzip-kind input files here, and I wonder if the same support exists for different kind of compressed files, such as

5条回答
  •  执笔经年
    2020-12-03 15:29

    Below is an example which searches a directory for .zip files and create an RDD using a custom FileInputFormat called ZipFileInputFormat and the newAPIHadoopFile API on the Spark Context. It then writes those files to an output directory.

    allzip.foreach { x =>
      val zipFileRDD = sc.newAPIHadoopFile(
        x.getPath.toString,
        classOf[ZipFileInputFormat],
        classOf[Text],
        classOf[BytesWritable], hadoopConf)
    
      zipFileRDD.foreach { y =>
        ProcessFile(y._1.toString, y._2)
      }
    

    https://github.com/alvinhenrick/apache-spark-examples/blob/master/src/main/scala/com/zip/example/Unzip.scala

    The ZipFileInputFormat used in the example can be found here: https://github.com/cotdp/com-cotdp-hadoop/tree/master/src/main/java/com/cotdp/hadoop

提交回复
热议问题