My map tasks need some configuration data, which I would like to distribute via the Distributed Cache.
The Hadoop MapReduce Tutorial shows the usage of the Distribut
I did not use job.addCacheFile(). Instead I used -files option like "-files /path/to/myfile.txt#myfile" as before. Then in the mapper or reducer code I use the method below:
/**
* This method can be used with local execution or HDFS execution.
*
* @param context
* @param symLink
* @param throwExceptionIfNotFound
* @return
* @throws IOException
*/
public static File findDistributedFileBySymlink(JobContext context, String symLink, boolean throwExceptionIfNotFound) throws IOException
{
URI[] uris = context.getCacheFiles();
if(uris==null||uris.length==0)
{
if(throwExceptionIfNotFound)
throw new RuntimeException("Unable to find file with symlink '"+symLink+"' in distributed cache");
return null;
}
URI symlinkUri = null;
for(URI uri: uris)
{
if(symLink.equals(uri.getFragment()))
{
symlinkUri = uri;
break;
}
}
if(symlinkUri==null)
{
if(throwExceptionIfNotFound)
throw new RuntimeException("Unable to find file with symlink '"+symLink+"' in distributed cache");
return null;
}
//if we run this locally the file system URI scheme will be "file" otherwise it should be a symlink
return "file".equalsIgnoreCase(FileSystem.get(context.getConfiguration()).getScheme())?(new File(symlinkUri.getPath())):new File(symLink);
}
Then in mapper/reducer:
@Override
protected void setup(Context context) throws IOException, InterruptedException
{
super.setup(context);
File file = HadoopUtils.findDistributedFileBySymlink(context,"myfile",true);
... do work ...
}
Note that if I used "-files /path/to/myfile.txt" directly then I need to use "myfile.txt" to access the file since that is the default symlink name.