How does Hadoop perform input splits?

前端 未结 10 851
礼貌的吻别
礼貌的吻别 2020-11-30 23:18

This is a conceptual question involving Hadoop/HDFS. Lets say you have a file containing 1 billion lines. And for the sake of simplicity, lets consider that each line is of

10条回答
  •  感动是毒
    2020-12-01 00:20

    The short answer is the InputFormat take care of the split of the file.

    The way that I approach this question is by looking at its default TextInputFormat class:

    All InputFormat classes are subclass of FileInputFormat, which take care of the split.

    Specifically, FileInputFormat's getSplit function generate a List of InputSplit, from the List of files defined in JobContext. The split is based on the size of bytes, whose Min and Max could be defined arbitrarily in project xml file.

提交回复
热议问题