Zip support in Apache Spark

后端 未结 5 2000
时光取名叫无心
时光取名叫无心 2020-12-03 15:02

I have read about Spark\'s support for gzip-kind input files here, and I wonder if the same support exists for different kind of compressed files, such as

5条回答
  •  谎友^
    谎友^ (楼主)
    2020-12-03 15:27

    Since Apache Spark uses Hadoop Input formats we can look at the hadoop documentation on how to process zip files and see if there is something that works.

    This site gives us an idea of how to use this (namely we can use the ZipFileInputFormat). That being said, since zip files are not split-table (see this) your request to have a single compressed file isn't really well supported. Instead, if possible, it would be better to have a directory containing many separate zip files.

    This question is similar to this other question, however it adds an additional question of if it would be possible to have a single zip file (which, since it isn't a split-table format isn't a good idea).

提交回复
热议问题