I am using Hadoop example program WordCount to process large set of small files/web pages (cca. 2-3 kB). Since this is far away from optimal file size for hadoop files, the
From my still limited understanding og Hadoop, I believe the right solution would be to create SequenceFile(s) containing your HTML files as values and possibly the URL as the key. If you do a M/R job over the SequenceFile(s), each mapper will process many files (depending on the split size). Each file will be presented to the map function as a single input.
You may want to use SequenceFileAsTextInputFormat as the InputFormat to read these files.
Also see: Providing several non-textual files to a single map in Hadoop MapReduce