I have a lot of zip files that need to be processed by a C++ library. So I use C++ to write my hadoop streaming program. The program will read a zip file, unzip it, and process the extracted data. My problem is that:
- my mapper can't get the content of exactly one file. It usually gets something like 2.4 files or 3.2 files. Hadoop will send several files to my mapper but at least one of the file is partial. You know zip files can't be processed like this. Can I get exactly one file per map? I don't want to use file list as input and read it from my program because I want to have the advantage of data locality. 
- I can accept the contents of multiple zip file per map if Hadoop don't split the zip files. I mean exactly 1, 2, 3 files, not something like 2.3 files. Actually it will be even better because my program need to load about 800MB data file for processing the unziped data. Can we do this? 
You can find the solution here:
The easiest way I would suggest is to set mapred.min.split.size to a large value so that your files do not get split.
If this does not work then you would need to implement an InputFormat which is not very difficult to do and you can find the steps at: http://developer.yahoo.com/hadoop/tutorial/module5.html#fileformat
Rather then depending on the min split size I would suggest an easier way is to Gzip your files.
There is a way to compress files using gzip
If you are on Linux you compress the extracted data with
gzip -r /path/to/data
Now that you have this pass this data as your input in your hadoop streaming job.
来源:https://stackoverflow.com/questions/14027594/hadoop-streaming-single-file-or-multi-file-per-map-dont-split