How does Hadoop perform input splits?

前端未结

关注

 10  851

礼貌的吻别 2020-11-30 23:18

This is a conceptual question involving Hadoop/HDFS. Lets say you have a file containing 1 billion lines. And for the sake of simplicity, lets consider that each line is of

10条回答

感动是毒 (楼主)

2020-12-01 00:20

The short answer is the InputFormat take care of the split of the file.

The way that I approach this question is by looking at its default TextInputFormat class:

All InputFormat classes are subclass of FileInputFormat, which take care of the split.

Specifically, FileInputFormat's getSplit function generate a List of InputSplit, from the List of files defined in JobContext. The split is based on the size of bytes, whose Min and Max could be defined arbitrarily in project xml file.

0 讨论(0)

查看其它10个回答
发布评论:

提交评论
- 加载中...