About Hadoop/HDFS file splitting

后端未结

关注

 3  2045

你的背包 2020-12-07 23:41

Want to just confirm on following. Please verify if this is correct: 1. As per my understanding when we copy a file into HDFS, that is the point when file (assuming its size

3条回答

旧巷少年郎 (楼主)

2020-12-08 00:05

David's answer pretty much hits the nail on its head, i am just elaborating on it here.

There are two distinct concepts at work here, each concept is handled by a different entity in the hadoop framework

Firstly --

1) Dividing a file into blocks -- When a file is written into HDFS, HDFS divides the file into blocks and takes care of its replication. This is done once (mostly), and then is available to all MR jobs running on the cluster. This is a cluster wide configuration

Secondly --

2) Splitting a file into input splits -- When input path is passed into a MR job, the MR job uses the path along with the input format configured to divide the files specified in the input path into splits, each split is processed by a map task. Calculation of input splits is done by the input format each time a job is executed

Now once we have this under our belt, we can understand that isSplitable() method comes under the second category.

To really nail this down have a look at the HDFS write data flow (Concept 1)

The second point in the diagram is probably where the split happens, note that this has nothing to do with running of a MR Job

Now have a look at the execution steps of a MR job

Here the first step is the calculation of the input splits via the inputformat configured for the job.

A lot of your confusion stems from the fact that you are clubbing both of these concepts, i hope this makes it a little clearer.

0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...