Change File Split size in Hadoop

前端未结

关注

 4  1960

陌清茗 2020-12-01 00:56

I have a bunch of small files in an HDFS directory. Although the volume of the files is relatively small, the amount of processing time per file is huge. Th

4条回答

独厮守ぢ (楼主)

2020-12-01 01:30
Hadoop the Definitive Guide, page 203 "The maximum split size defaults to the maximum value that can be represented by a Java long type. It has an effect only when it is less than the block size, forcing splits to be smaller than a block. The split size is calculated by the formula:
```
max(minimumSize, min(maximumSize, blockSize))
```
by default
```
minimumSize < blockSize < maximumSize
```
so the split size is blockSize

For example,
```
Minimum Split Size 1
Maximum Split Size 32mb
Block Size  64mb
Split Size  32mb
```
Hadoop Works better with a small number of large files than a large number of small files. One reason for this is that FileInputFormat generates splits in such a way that each split is all or part of a single file. If the file is very small ("small" means significantly smaller than an HDFS block) and there are a lot of them, then each map task will process very little input, and there will be a lot of them (one per file), each of which imposes extra bookkeeping overhead. Compare a 1gb file broken into sixteen 64mb blocks, and 10.000 or so 100kb files. The 10.000 files use one map each, and the job time can be tens or hundreds of times slower than the equivalent one with a single input file and 16 map tasks.
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...