Get Total Input Path Count in Hadoop Mapper

ⅰ亾dé卋堺 提交于 2019-12-11 18:00:19

问题


We are trying to grab the total number of input paths our MapReduce program is iterating through in our mapper. We are going to use this along with a counter to format our value depending on the index. Is there an easy way to pull the total input path count from the mapper? Thanks in advance.


回答1:


You could look through the source for FileInputFormat.getSplits() - this pulls back the configuration property for mapred.input.dir and then resolves this CSV to an array of Paths.

These paths can still represent folders and regex's so the next thing getSplits() does is to pass the array to a protected method org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(JobContext). This actually goes through the dirs / regex's listed and lists the directory / regex matching files (also invoking a PathFilter if configured).

So with this method being protected, you could create a simple 'dummy' extension of FileInputFormat that has a listStatus method, accepting the Mapper.Context as it's argument, and in turn wrap a call to the FileInputFormat.listStatus method:

public class DummyFileInputFormat extends FileInputFormat {
    public List<FileStatus> listStatus(Context mapContext) throws IOException {
        return super.listStatus(mapContext);
    }

    @Override
    public RecordReader createRecordReader(InputSplit split,
            TaskAttemptContext context) throws IOException,
            InterruptedException {
        // dummy input format, so this will never be called
        return null;
    }
}

EDIT: In fact it looks like FileInputFormat already does this for you, configuring a job property mapreduce.input.num.files at the end of the getSplits() method (at least in 1.0.2, probably introduced in 0.20.203)

Here's the JIRA ticket




回答2:


you can setup a configuration in your job with the number of input paths. just like

jobConf.setInt("numberOfPaths",paths.length);

just put the code in that place where you configure your job. After that read it out of the configuration in your Mapper.setup(Mapper.Context context) by getting it from the context.



来源:https://stackoverflow.com/questions/10585560/get-total-input-path-count-in-hadoop-mapper

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!