Processing large set of small files with Hadoop

后端 未结 5 931
失恋的感觉
失恋的感觉 2021-01-01 00:17

I am using Hadoop example program WordCount to process large set of small files/web pages (cca. 2-3 kB). Since this is far away from optimal file size for hadoop files, the

5条回答
  •  萌比男神i
    2021-01-01 00:50

    CombineFileInputFormat can be used in this case which works well for large numaber of small files. This packs many of such files in a single split thus each mapper has more to process (1 split = 1 map task). The overall processing time for mapreduce will also will also fall since there are lesser number of mappers running. Since ther are no archive-aware InputFormat using CombineFileInputFormat will improve performance.

提交回复
热议问题