Hadoop gzip input file using only one mapper [duplicate]

问题

Possible Duplicate:
Why can't hadoop split up a large text file and then compress the splits using gzip?

I found that when using input file that is gzipped the Hadoop chooses to allocate only one map task to handle my map/reduce job.

The gzipped file is more than 1.4 GB, so I would expect many mappers to run in parallel (exacly like when using un-zipped file)

Is there any configuration I can do to improve it?

回答1:

Gzip files can't be split, so all the data is being processed by only one map. Some other compression algorithm in which compressed files can be split has to be used, then the data will be processed by multiple maps. Here is a nice article on it. (1)

Edit: Here is another article on Snappy (2) which is from Google.

(1) http://blog.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/

(2) http://blog.cloudera.com/blog/2011/09/snappy-and-hadoop/

来源：https://stackoverflow.com/questions/7388436/hadoop-gzip-input-file-using-only-one-mapper

标签

Hadoop

gzip

MapReduce

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!