How to use a MapReduce output in Distributed Cache

问题

Lets say i have a MapReduce Job which is creating an output file part-00000 and there is one more job running after the completion of this job.

How can i use the output file of the first job in the Distributed cache for the second job.

回答1:

The below steps might help you,

Pass the first job's output directory path to the Second job's Driver class.

Use Path Filter to list files starts with part-*. Refer the below code snippet for your second job driver class,

    FileSystem fs = FileSystem.get(conf);
    FileStatus[] fileList = fs.listStatus(new Path("1st job o/p path") , 
                               new PathFilter(){
                                     @Override public boolean accept(Path path){
                                            return path.getName().startsWith("part-");
                                     } 
                                } );

Iterate over every part-* file and add it to distribute cache.

    for(int i=0; i < fileList.length;i++){ 
             DistributedCache.addCacheFile(new URI(fileList[i].getPath().toUri()));
    }

来源：https://stackoverflow.com/questions/30224370/how-to-use-a-mapreduce-output-in-distributed-cache

标签

Hadoop

MapReduce

distributed-cache

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!