Change File Split size in Hadoop

前端未结

关注

 4  1945

陌清茗 2020-12-01 00:56

I have a bunch of small files in an HDFS directory. Although the volume of the files is relatively small, the amount of processing time per file is huge. Th

4条回答

爱一瞬间的悲伤 (楼主)

2020-12-01 01:41
Here is fragment which illustrates correct way to do what is needed here without magic configuration strings. Needed constant is defined inside FileInputFormat. Block size can be taken if needed from default HDFS block constant but it has pretty good probability to be user defined.

Here I just divide maximum split size by 2 if it was defined.
```
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

// ....

final long DEFAULT_SPLIT_SIZE = 128 * 1024 * 1024;
final Configuration conf = ...

// We need to lower input block size by factor of two.
conf.setLong(
    FileInputFormat.SPLIT_MAXSIZE,
    conf.getLong(
        FileInputFormat.SPLIT_MAXSIZE, DEFAULT_SPLIT_SIZE) / 2);
```
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...