Life of distributed cache in Hadoop

心不动则不痛 提交于 2019-12-07 04:42:25

问题


When files are transferred to nodes using the distributed cache mechanism in a Hadoop streaming job, does the system delete these files after a job is completed? If they are deleted, which i presume they are, is there a way to make the cache remain for multiple jobs? Does this work the same way on Amazon's Elastic Mapreduce?


回答1:


I was digging around in the source code, and it looks like files are deleted by TrackerDistributedCacheManager about once a minute when their reference count drops to zero. The TaskRunner explicitly releases all its files at the end of a task. Maybe you should edit TaskRunner to not do this, and control the cache through more explicit means yourself?




回答2:


I cross posted this question at the AWS forum and got a good recommendation to use hadoop fs -get to transfer files in a way that persists across jobs.



来源:https://stackoverflow.com/questions/4483733/life-of-distributed-cache-in-hadoop

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!