How the data is split in Hadoop

后端 未结 5 1534
不知归路
不知归路 2021-01-31 12:17

Does the Hadoop split the data based on the number of mappers set in the program? That is, having a data set of size 500MB, if the number of mappers is 200 (assuming that the Ha

5条回答
  •  别跟我提以往
    2021-01-31 12:36

    No. It's not.

    Number of Mappers for a Job is defined by Framework.

    Have a look at Apache MapReduce tutorial link.

    How Many Maps?

    The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files.

    The right level of parallelism for maps seems to be around 10-100 maps per-node, although it has been set up to 300 maps for very cpu-light map tasks. Task setup takes a while, so it is best if the maps take at least a minute to execute.

    Thus, if you expect 10TB of input data and have a blocksize of 128MB, you’ll end up with 82,000 maps, unless Configuration.set(MRJobConfig.NUM_MAPS, int) (which only provides a hint to the framework) is used to set it even higher.

    Coming back to your queries :

    That is, having a data set of size 500MB, if the number of mappers is 200 (assuming that the Hadoop cluster allows 200 mappers simultaneously), is each mapper given 2.5 MB of data?

    If DFS block and Input Split size is 128 MB, then 500 MB file requires 4 Mappers to process the data. Framework will run 4 Mapper tasks in above case.

    Do all the mappers run simultaneously or some of them might get run in serial?

    All Mappers run simultaneously. But Reducer will run only when output from all Mappers has been copied and available for them.

提交回复
热议问题