How the data is split in Hadoop

后端未结

关注

 5  1498

不知归路 2021-01-31 12:17

Does the Hadoop split the data based on the number of mappers set in the program? That is, having a data set of size 500MB, if the number of mappers is 200 (assuming that the Ha

5条回答

我在风中等你 (楼主)

2021-01-31 12:33

When you input data into Hadoop Distributed File System (HDFS), Hadoop splits your data depending on the block size (default 64 MB) and distributes the blocks across the cluster. So your 500 MB will be split into 8 blocks. It does not depend on the number of mappers, it is the property of HDFS.

Now, when you run a MapReduce job, Hadoop by default assigns 1 mapper per block, so if you have 8 blocks, hadoop will run 8 map tasks.

However, if you specify the number of mappers explicitly (i.e 200), then the size of data being processed by each Map depends on the distribution of the blocks, and on which node your mapper is running. How many mappers actually process your data depends on your input split.

In your case, assuming 500 MB split into 8 blocks, even if you specify 200 mappers, not all of them will process data even if they are initialized.

0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...