how to limit size of Hadoop Sequence file?

浪尽此生 提交于 2019-12-12 02:20:41

问题


I am writing Hadoop seq file using txt as input. I know how to write Sequence file from text file.

But i want to limit the output sequence file to some specific size say, 256MB.

Is there any inbuilt method to do this?


回答1:


AFIAK you'll need to write your own custom output format to limit output file sizes - by default FileOutputFormats create a single output file per reducer.

Another option is to create your sequence files as normal, then then a second job (map only), with identity mappers and then amend the minimum / maximum input split size to ensure that each mapper only processes ¬256MB each. This will mean a input file og 1GB would be processed by 4 mappers and create output files of ¬256MB. You will get smaller files where an input file is say 300MB (256MB mapper and a 44MB mapper will run).

The properties you are looking for are:

  • mapred.min.split.size
  • mapred.max.split.size

They are both configured as byte sizes, so set them both to 268435456



来源:https://stackoverflow.com/questions/15610116/how-to-limit-size-of-hadoop-sequence-file

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!