Change File Split size in Hadoop

前端 未结 4 1945
陌清茗
陌清茗 2020-12-01 00:56

I have a bunch of small files in an HDFS directory. Although the volume of the files is relatively small, the amount of processing time per file is huge. Th

4条回答
  •  爱一瞬间的悲伤
    2020-12-01 01:41

    Here is fragment which illustrates correct way to do what is needed here without magic configuration strings. Needed constant is defined inside FileInputFormat. Block size can be taken if needed from default HDFS block constant but it has pretty good probability to be user defined.

    Here I just divide maximum split size by 2 if it was defined.

    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    
    // ....
    
    final long DEFAULT_SPLIT_SIZE = 128 * 1024 * 1024;
    final Configuration conf = ...
    
    // We need to lower input block size by factor of two.
    conf.setLong(
        FileInputFormat.SPLIT_MAXSIZE,
        conf.getLong(
            FileInputFormat.SPLIT_MAXSIZE, DEFAULT_SPLIT_SIZE) / 2);
    

提交回复
热议问题