Hadoop's input splitting- How does it work

前端 未结 2 1424
长情又很酷
长情又很酷 2021-01-16 20:15

I know brief about hadoop

I am curious to know how does it work.

To be precise I want to know, how exactly it divides/splits the input file.

Does it

2条回答
  •  情深已故
    2021-01-16 20:37

    This is dependent on the InputFormat, which for most file-based formats is defined in the FileInputFormat base class.

    There are a number of configurable options which denote how hadoop will take a single file and either process it as a single split, or divide the file into multiple splits:

    • If the input file is compressed, the input format and compression method must be splittable. Gzip for example is not splittable (you can't randomly seek to a point in the file and recover the compressed stream). BZip2 is splittable. See the specific InputFormat.isSplittable() implementation for your input format for more information
    • If the file size is less than or equal to its defined HDFS block size, then hadoop will most probably process it in a single split (this can be configured, see a later point about split size properties)
    • If the file size is greater than its defined HDFS block size, then hadoop will most probably divide up the file into splits based upon the underlying blocks (4 blocks would result in 4 splits)
    • You can configure two properties mapred.min.split.size and mapred.max.split.size which help the input format when breaking up blocks into splits. Note that the minimum size may be overriden by the input format (which may have a fixed minumum input size)

    If you want to know more, and are comfortable looking through the source, check out the getSplits() method in FileInputFormat (both the new and old api have the same method, but they may have some suttle differences).

提交回复
热议问题