I have a bunch of small files in an HDFS directory. Although the volume of the files is relatively small, the amount of processing time per file is huge. Th
Here is fragment which illustrates correct way to do what is needed here without magic configuration strings. Needed constant is defined inside FileInputFormat. Block size can be taken if needed from default HDFS block constant but it has pretty good probability to be user defined.
Here I just divide maximum split size by 2 if it was defined.
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
// ....
final long DEFAULT_SPLIT_SIZE = 128 * 1024 * 1024;
final Configuration conf = ...
// We need to lower input block size by factor of two.
conf.setLong(
FileInputFormat.SPLIT_MAXSIZE,
conf.getLong(
FileInputFormat.SPLIT_MAXSIZE, DEFAULT_SPLIT_SIZE) / 2);