Disk Spill during MapReduce

问题

I have a pretty basic question that I am trying to find an answer for. I was looking through the documentation to understand where is the data spilled to during the map phase, shuffle phase and reduce phase? As in if Mapper A has 16 GB of RAM, but if the allocated memory for a mapper has exceeded then the data is spilled.

Is the data spilled to HDFS or will the data be spilled to a tmp folder on the disk? During the shuffle phase, is the data streamed from one node to another node and is stored in HDFS or in a temporary storage location.

The reason I ask these questions is to figure out if there needs to be a clean up process after the job is done. Please help.

回答1:

Mapper's intermediate files (spilled files) are stored in the local filesystem of the worker node where the Mapper is running. Similarly the data streamed from one node to another node is stored in local filesystem of the worker node where the task is running.

This local filesystem path is specified by hadoop.tmp.dir property which by default is '/tmp'.

And after the completion or failure of the job the temporary location used on the local filesystem get's cleared automatically you don't have to perform any clean up process, it's automatically handled by the framework.

来源：https://stackoverflow.com/questions/29262194/disk-spill-during-mapreduce

标签

java

Hadoop

MapReduce

shuffle

yarn

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!