问题
We are trying to grab the total number of input paths our MapReduce program is iterating through in our mapper. We are going to use this along with a counter to format our value depending on the index. Is there an easy way to pull the total input path count from the mapper? Thanks in advance.
回答1:
You could look through the source for FileInputFormat.getSplits()
- this pulls back the configuration property for mapred.input.dir
and then resolves this CSV to an array of Paths.
These paths can still represent folders and regex's so the next thing getSplits() does is to pass the array to a protected method org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(JobContext)
. This actually goes through the dirs / regex's listed and lists the directory / regex matching files (also invoking a PathFilter
if configured).
So with this method being protected, you could create a simple 'dummy' extension of FileInputFormat that has a listStatus method, accepting the Mapper.Context as it's argument, and in turn wrap a call to the FileInputFormat.listStatus method:
public class DummyFileInputFormat extends FileInputFormat {
public List<FileStatus> listStatus(Context mapContext) throws IOException {
return super.listStatus(mapContext);
}
@Override
public RecordReader createRecordReader(InputSplit split,
TaskAttemptContext context) throws IOException,
InterruptedException {
// dummy input format, so this will never be called
return null;
}
}
EDIT: In fact it looks like FileInputFormat
already does this for you, configuring a job property mapreduce.input.num.files
at the end of the getSplits() method (at least in 1.0.2, probably introduced in 0.20.203)
Here's the JIRA ticket
回答2:
you can setup a configuration in your job with the number of input paths. just like
jobConf.setInt("numberOfPaths",paths.length);
just put the code in that place where you configure your job. After that read it out of the configuration in your Mapper.setup(Mapper.Context context)
by getting it from the context.
来源:https://stackoverflow.com/questions/10585560/get-total-input-path-count-in-hadoop-mapper