Hadoop DistributedCache is deprecated - what is the preferred API?

前端 未结 6 1266
情深已故
情深已故 2020-11-28 04:14

My map tasks need some configuration data, which I would like to distribute via the Distributed Cache.

The Hadoop MapReduce Tutorial shows the usage of the Distribut

6条回答
  •  夕颜
    夕颜 (楼主)
    2020-11-28 05:16

    I did not use job.addCacheFile(). Instead I used -files option like "-files /path/to/myfile.txt#myfile" as before. Then in the mapper or reducer code I use the method below:

    /**
     * This method can be used with local execution or HDFS execution. 
     * 
     * @param context
     * @param symLink
     * @param throwExceptionIfNotFound
     * @return
     * @throws IOException
     */
    public static File findDistributedFileBySymlink(JobContext context, String symLink, boolean throwExceptionIfNotFound) throws IOException
    {
        URI[] uris = context.getCacheFiles();
        if(uris==null||uris.length==0)
        {
            if(throwExceptionIfNotFound)
                throw new RuntimeException("Unable to find file with symlink '"+symLink+"' in distributed cache");
            return null;
        }
        URI symlinkUri = null;
        for(URI uri: uris)
        {
            if(symLink.equals(uri.getFragment()))
            {
                symlinkUri = uri;
                break;
            }
        }   
        if(symlinkUri==null)
        {
            if(throwExceptionIfNotFound)
                throw new RuntimeException("Unable to find file with symlink '"+symLink+"' in distributed cache");
            return null;
        }
        //if we run this locally the file system URI scheme will be "file" otherwise it should be a symlink
        return "file".equalsIgnoreCase(FileSystem.get(context.getConfiguration()).getScheme())?(new File(symlinkUri.getPath())):new File(symLink);
    
    }
    

    Then in mapper/reducer:

    @Override
    protected void setup(Context context) throws IOException, InterruptedException
    {
        super.setup(context);
    
        File file = HadoopUtils.findDistributedFileBySymlink(context,"myfile",true);
        ... do work ...
    }
    

    Note that if I used "-files /path/to/myfile.txt" directly then I need to use "myfile.txt" to access the file since that is the default symlink name.

提交回复
热议问题