mapreduce in java - gzip input files

心不动则不痛 提交于 2020-01-23 10:40:28

问题


I'm using java, and i'm trying to write a mapreduce that will recieve as an Input a folder containing multiple gz files.

I've been looking all over but all the tutorials that i've found exmplain how to process a simple text File, but haven't found anything that solves my problem.

I've asked around at my workplace, but only got references to scala, which i'm not familier with.

Any help would be appreciated.


回答1:


Hadoop checks the file extension to detect compressed files. The compression types supported by Hadoop are: gzip, bzip2, and LZO. You do not need to take any additional action to extract files using these types of compression; Hadoop handles it for you.

So all you have to do write the logic as you would for a text file and pass in the directory which contains the .gz files as input.

But the issue with gzip files is that they are not splittable, imagine you have gzip files of each 5GB, then each mapper will process on the whole 5GB file instead of working with the default block size.



来源:https://stackoverflow.com/questions/26576985/mapreduce-in-java-gzip-input-files

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!