This is a conceptual question involving Hadoop/HDFS. Lets say you have a file containing 1 billion lines. And for the sake of simplicity, lets consider that each line is of
The short answer is the InputFormat take care of the split of the file.
The way that I approach this question is by looking at its default TextInputFormat class:
All InputFormat classes are subclass of FileInputFormat, which take care of the split.
Specifically, FileInputFormat's getSplit function generate a List of InputSplit, from the List of files defined in JobContext. The split is based on the size of bytes, whose Min and Max could be defined arbitrarily in project xml file.