Processing large set of small files with Hadoop

后端 未结 5 943
失恋的感觉
失恋的感觉 2021-01-01 00:17

I am using Hadoop example program WordCount to process large set of small files/web pages (cca. 2-3 kB). Since this is far away from optimal file size for hadoop files, the

5条回答
  •  天命终不由人
    2021-01-01 00:36

    From my still limited understanding og Hadoop, I believe the right solution would be to create SequenceFile(s) containing your HTML files as values and possibly the URL as the key. If you do a M/R job over the SequenceFile(s), each mapper will process many files (depending on the split size). Each file will be presented to the map function as a single input. You may want to use SequenceFileAsTextInputFormat as the InputFormat to read these files.

    Also see: Providing several non-textual files to a single map in Hadoop MapReduce

提交回复
热议问题