How to use a MapReduce output in Distributed Cache

十年热恋 提交于 2019-12-12 00:33:57

问题


Lets say i have a MapReduce Job which is creating an output file part-00000 and there is one more job running after the completion of this job.

How can i use the output file of the first job in the Distributed cache for the second job.


回答1:


The below steps might help you,

  • Pass the first job's output directory path to the Second job's Driver class.

  • Use Path Filter to list files starts with part-*. Refer the below code snippet for your second job driver class,

        FileSystem fs = FileSystem.get(conf);
        FileStatus[] fileList = fs.listStatus(new Path("1st job o/p path") , 
                                   new PathFilter(){
                                         @Override public boolean accept(Path path){
                                                return path.getName().startsWith("part-");
                                         } 
                                    } );
    
  • Iterate over every part-* file and add it to distribute cache.

        for(int i=0; i < fileList.length;i++){ 
                 DistributedCache.addCacheFile(new URI(fileList[i].getPath().toUri()));
        }
    


来源:https://stackoverflow.com/questions/30224370/how-to-use-a-mapreduce-output-in-distributed-cache

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!