问题
I am trying to send two files to a hadoop reducer. I tried DistributedCache, but anything I put using addCacheFile in main, doesn't seem to be given back to with getLocalCacheFiles in the mapper.
right now I am using FileSystem to read the file, but I am running locally so I am able to just send the name of the file. Wondering how to do this if I was running on a real hadoop system.
is there anyway to send values to the mapper except the file that it's reading?
回答1:
I also had a lot of problems with distribution cache, and sending parameters. Options worked for me are below:
For distributed cache usage: For me it was a nightmare to get the url/path to file on HDFS in Map or Reduce, but with symlink it worked in run() method of the job
DistributedCache.addCacheFile(new URI(file+"#rules.dat"), conf);
DistributedCache.createSymlink(conf);
and then read in Map or Reduce in header, before methods
public static FileSystem hdfs;
and then in setup() method of Map or Reduce
hdfs = FileSystem.get(new Configuration()).open(new Path ("rules.dat"));
For parameters: Send some values to Map or Reduce (could be a filename to open from HDFS):
public int run(String[] args) throws Exception {
Configuration conf = new Configuration();
...
conf.set("level", otherArgs[2]); //sets variable level from command line, it could be a filename
...
}
then in Map or Reduce class just:
int level = Integer.parseInt(conf.get("level")); //this is int, but you can read also strings, etc.
回答2:
If distributed cache suites your need - it is a way to go.
getLocalCacheFiles works differently in the local mode and in the distributed mode. (it actually do not work in local mode).
Look into this link: http://developer.yahoo.com/hadoop/tutorial/module5.html look for the phrase: As a cautionary note:
来源:https://stackoverflow.com/questions/9148724/multiple-input-into-a-mapper-in-hadoop