Hadoop Reducer: How can I output to multiple directories using speculative execution?

一世执手 提交于 2019-12-02 04:51:55

The way Hadoop typically deals with speculative execution is to create an output folder for each task attempt (in a _temporary subfolder of the actual HDFS output directory).

The OutputCommitter for the OutputFormat then simply moves the contents of the temp task folder to the actual output folder when a task succeeds, and deletes the other temp task folders for those failed / aborted (this is the default behavior for most FileOutputFormats)

So for your case, if you are writing to a folder outside of the job output folder, then you'll need to extend / implement your own output committer. I'd follow the same principals when creating the files - include the full task id (including the attempt id) to avoid name collisions when speculatively executing. How you track the files created in your job and manage the deletion in the abort / fail scenarios is up to you (maybe some file globing for the task ids?)

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!