Processing large set of small files with Hadoop

后端未结

关注

 5  943

失恋的感觉 2021-01-01 00:17

I am using Hadoop example program WordCount to process large set of small files/web pages (cca. 2-3 kB). Since this is far away from optimal file size for hadoop files, the

5条回答

天命终不由人 (楼主)

2021-01-01 00:36

From my still limited understanding og Hadoop, I believe the right solution would be to create SequenceFile(s) containing your HTML files as values and possibly the URL as the key. If you do a M/R job over the SequenceFile(s), each mapper will process many files (depending on the split size). Each file will be presented to the map function as a single input. You may want to use SequenceFileAsTextInputFormat as the InputFormat to read these files.

Also see: Providing several non-textual files to a single map in Hadoop MapReduce

0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...