Does the Hadoop split the data based on the number of mappers set in the program? That is, having a data set of size 500MB, if the number of mappers is 200 (assuming that the Ha
When you input data into Hadoop Distributed File System (HDFS), Hadoop splits your data depending on the block size (default 64 MB) and distributes the blocks across the cluster. So your 500 MB will be split into 8 blocks. It does not depend on the number of mappers, it is the property of HDFS.
Now, when you run a MapReduce job, Hadoop by default assigns 1 mapper per block, so if you have 8 blocks, hadoop will run 8 map tasks.
However, if you specify the number of mappers explicitly (i.e 200), then the size of data being processed by each Map depends on the distribution of the blocks, and on which node your mapper is running. How many mappers actually process your data depends on your input split.
In your case, assuming 500 MB split into 8 blocks, even if you specify 200 mappers, not all of them will process data even if they are initialized.