Filtering input files using globStatus in MapReduce

问题

I have a lot of input files and I want to process selected ones based on the date that has been appended in the end. I am now confused on where do I use the globStatus method to filter out the files.

I have a custom RecordReader class and I was trying to use globStatus in its next method but it didn't work out.

public boolean next(Text key, Text value) throws IOException {
    Path filePath = fileSplit.getPath();

    if (!processed) {
        key.set(filePath.getName());

        byte[] contents = new byte[(int) fileSplit.getLength()];
        value.clear();
        FileSystem fs = filePath.getFileSystem(conf);
        fs.globStatus(new Path("/*" + date));
        FSDataInputStream in = null;

        try {
            in = fs.open(filePath);
            IOUtils.readFully(in, contents, 0, contents.length);
            value.set(contents, 0, contents.length);
        } finally {
            IOUtils.closeStream(in);
        }
        processed = true;
        return true;
    }
    return false;
}

I know it returns a FileStatus array, but how do I use it to filter the files. Can someone please shed some light?

回答1:

The globStatus method takes 2 complimentary arguments which allow you to filter your files. The first one is the glob pattern, but sometimes glob patterns are not powerful enough to filter specific files, in which case you can define a PathFilter.

Regarding the glob pattern, the following are supported:

Glob   | Matches
-------------------------------------------------------------------------------------------------------------------
*      | Matches zero or more characters
?      | Matches a single character
[ab]   | Matches a single character in the set {a, b}
[^ab]  | Matches a single character not in the set {a, b}
[a-b]  | Matches a single character in the range [a, b] where a is lexicographically less than or equal to b
[^a-b] | Matches a single character not in the range [a, b] where a is lexicographically less than or equal to b
{a,b}  | Matches either expression a or b
\c     | Matches character c when it is a metacharacter

PathFilter is simply an interface like this:

public interface PathFilter {
    boolean accept(Path path);
}

So you can implement this interface and implement the accept method where you can put your logic to filter files.

An example taken from Tom White's excellent book which allows you to define a PathFilter to filter files that match a certain regular expression:

public class RegexExcludePathFilter implements PathFilter {
    private final String regex;

    public RegexExcludePathFilter(String regex) {
        this.regex = regex;
    }

    public boolean accept(Path path) {
        return !path.toString().matches(regex);
    }
}

You can directly filter your input with a PathFilter implementation by calling FileInputFormat.setInputPathFilter(JobConf, RegexExcludePathFilter.class) when initializing your job.

EDIT: Since you have to pass the class in setInputPathFilter, you can't directly pass arguments, but you should be able to do something similar by playing with the Configuration. If you make your RegexExcludePathFilter also extend from Configured, you can get back a Configuration object which you will have initialized before with the desired values, so you can get back these values inside your filter and process them in the accept.

For example if you initialize like this:

conf.set("date", "2013-01-15");

Then you can define your filter like this:

public class RegexIncludePathFilter extends Configured implements PathFilter {
    private String date;
    private FileSystem fs;

    public boolean accept(Path path) {
        try {
            if (fs.isDirectory(path)) {
                return true;
            }
        } catch (IOException e) {}
        return path.toString().endsWith(date);
    }

    public void setConf(Configuration conf) {
        if (null != conf) {
            this.date = conf.get("date");
            try {
                this.fs = FileSystem.get(conf);
            } catch (IOException e) {}
        }
    }
}

EDIT 2: There were a few issues with the original code, please see the updated class. You also need to remove the constructor since it's not used anymore, and check if that's a directory in which case you should return true so the content of the directory can be filtered too.

回答2:

For anyone reading this, can I say "please don't do anything more complex in the filters than validating the paths". Specifically: don't do checks for the files being a directory, getting their sizes, etc. Wait until the list/glob operation has returned and then do a filtering there, using the information now in the populated FileStatus entries.

Why? All those calls to getFileStatus(), directly or via isDirectory() are doing needless calls to the filesystem, calls which add needless namenode load on an HDFS cluster. More critically, against S3 and other object stores, each operation is potentially making multiple HTTPS requests —and those really do take measurable time. Even better, S3 will throttle you if it thinks you are making too many requests across your entire cluster of machines. You don't want that.

Wit until after the call —the file status entries you get back are those from the object store's list commands, which usually return thousands of file entries per HTTPS request, and so are way more efficient.

For further details, inspect the source of org.apache.hadoop.fs.s3a.S3AFileSystem.

来源：https://stackoverflow.com/questions/14332330/filtering-input-files-using-globstatus-in-mapreduce

标签

java

Hadoop

MapReduce

Cloudera