I have a bunch of small files in an HDFS directory. Although the volume of the files is relatively small, the amount of processing time per file is huge. Th
Write a custom input format which extends combinefileinputformat[has its own pros nad cons base don the hadoop distribution]. which combines the input splits into the value specified in mapred.max.split.size