I am using Hadoop example program WordCount to process large set of small files/web pages (cca. 2-3 kB). Since this is far away from optimal file size for hadoop files, the
CombineFileInputFormat can be used in this case which works well for large numaber of small files. This packs many of such files in a single split thus each mapper has more to process (1 split = 1 map task). The overall processing time for mapreduce will also will also fall since there are lesser number of mappers running. Since ther are no archive-aware InputFormat using CombineFileInputFormat will improve performance.