Hadoop chunk size vs split vs block size

问题

I am little bit confused about Hadoop concepts.

What is the difference between Hadoop Chunk size , Split size and Block size?

Thanks in advance.

回答1:

Block size & Chunk Size are same. Split size may be different to Block/Chunk size.

Map Reduce algorithm does not work on physical blocks of the file. It works on logical input splits. Input split depends on where the record was written. A record may span two mappers.

The way HDFS has been set up, it breaks down very large files into large blocks (for example, measuring 128MB), and stores three copies of these blocks on different nodes in the cluster. HDFS has no awareness of the content of these files.

To solve this problem, Hadoop uses a logical representation of the data stored in file blocks, known as input splits. When a MapReduce job client calculates the input splits, it figures out where the first whole record in a block begins and where the last record in the block ends.

In cases where the last record in a block is incomplete, the input split includes location information for the next block and the byte offset of the data needed to complete the record.

Have a look at this article for more details.