问题
I have a pretty basic question that I am trying to find an answer for. I was looking through the documentation to understand where is the data spilled to during the map phase, shuffle phase and reduce phase? As in if Mapper A has 16 GB of RAM, but if the allocated memory for a mapper has exceeded then the data is spilled.
Is the data spilled to HDFS or will the data be spilled to a tmp folder on the disk? During the shuffle phase, is the data streamed from one node to another node and is stored in HDFS or in a temporary storage location.
The reason I ask these questions is to figure out if there needs to be a clean up process after the job is done. Please help.
回答1:
Mapper's intermediate files (spilled files) are stored in the local filesystem of the worker node where the Mapper is running. Similarly the data streamed from one node to another node is stored in local filesystem of the worker node where the task is running.
This local filesystem path is specified by hadoop.tmp.dir
property which by default is '/tmp'.
And after the completion or failure of the job the temporary location used on the local filesystem get's cleared automatically you don't have to perform any clean up process, it's automatically handled by the framework.
来源:https://stackoverflow.com/questions/29262194/disk-spill-during-mapreduce