Hadoop chunk size vs split vs block size

本小妞迷上赌 提交于 2020-01-02 07:03:43

问题


I am little bit confused about Hadoop concepts.

What is the difference between Hadoop Chunk size , Split size and Block size?

Thanks in advance.


回答1:


Block size & Chunk Size are same. Split size may be different to Block/Chunk size.

Map Reduce algorithm does not work on physical blocks of the file. It works on logical input splits. Input split depends on where the record was written. A record may span two mappers.

The way HDFS has been set up, it breaks down very large files into large blocks (for example, measuring 128MB), and stores three copies of these blocks on different nodes in the cluster. HDFS has no awareness of the content of these files.

To solve this problem, Hadoop uses a logical representation of the data stored in file blocks, known as input splits. When a MapReduce job client calculates the input splits, it figures out where the first whole record in a block begins and where the last record in the block ends.

In cases where the last record in a block is incomplete, the input split includes location information for the next block and the byte offset of the data needed to complete the record.

Have a look at this article for more details.

Related SE questions:

About Hadoop/HDFS file splitting

Split size vs Block size in Hadoop



来源:https://stackoverflow.com/questions/34704312/hadoop-chunk-size-vs-split-vs-block-size

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!