Hadoop: NullPointerException with Custom InputFormat

问题

I've developed a custom InputFormat for Hadoop (including a custom InputSplit and a custom RecordReader) and I'm experiencing a rare NullPointerException.

These classes are going to be used for querying a third-party system which exposes a REST API for records retrieving. Thus, I got inspiration in DBInputFormat, which is a non-HDFS InputFormat as well.

The error I get is the following:

Error: java.lang.NullPointerException at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:524)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:762)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)

I've searched the code for MapTask (2.1.0 version of Hadoop) and I've seen the problematic part is the initialization of the RecordReader:

472 NewTrackingRecordReader(org.apache.hadoop.mapreduce.InputSplit split,
473       org.apache.hadoop.mapreduce.InputFormat<K, V> inputFormat,
474       TaskReporter reporter,
475       org.apache.hadoop.mapreduce.TaskAttemptContext taskContext)
476       throws InterruptedException, IOException {
...
491    this.real = inputFormat.createRecordReader(split, taskContext);
...
494 }
...
519 @Override
520 public void initialize(org.apache.hadoop.mapreduce.InputSplit split,
521       org.apache.hadoop.mapreduce.TaskAttemptContext context
522       ) throws IOException, InterruptedException {
523    long bytesInPrev = getInputBytes(fsStats);
524    real.initialize(split, context);
525    long bytesInCurr = getInputBytes(fsStats);
526    fileInputByteCounter.increment(bytesInCurr - bytesInPrev);
527 }

Of course, the relevant parts of my code:

# MyInputFormat.java

public static void setEnvironmnet(Job job, String host, String port, boolean ssl, String APIKey) {
    backend = new Backend(host, port, ssl, APIKey);
}

public static void addResId(Job job, String resId) {
    Configuration conf = job.getConfiguration();
    String inputs = conf.get(INPUT_RES_IDS, "");

    if (inputs.isEmpty()) {
        inputs += restId;
    } else {
        inputs += "," + resId;
    }

    conf.set(INPUT_RES_IDS, inputs);
}

@Override
public List<InputSplit> getSplits(JobContext job) {
    // resulting splits container
    List<InputSplit> splits = new ArrayList<InputSplit>();

    // get the Job configuration
    Configuration conf = job.getConfiguration();

    // get the inputs, i.e. the list of resource IDs
    String input = conf.get(INPUT_RES_IDS, "");
    String[] resIDs = StringUtils.split(input);

    // iterate on the resIDs
    for (String resID: resIDs) {
       splits.addAll(getSplitsResId(resID, job.getConfiguration()));
    }

    // return the splits
    return splits;
}

@Override
public RecordReader<LongWritable, Text> createRecordReader(InputSplit split, TaskAttemptContext context) {
    if (backend == null) {
        logger.info("Unable to create a MyRecordReader, it seems the environment was not properly set");
        return null;
    }

    // create a record reader
    return new MyRecordReader(backend, split, context);
}

# MyRecordReader.java

@Override
public void initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
    // get start, end and current positions
    MyInputSplit inputSplit = (MyInputSplit) this.split;
    start = inputSplit.getFirstRecordIndex();
    end = start + inputSplit.getLength();
    current = 0;

    // query the third-party system for the related resource, seeking to the start of the split
    records = backend.getRecords(inputSplit.getResId(), start, end);
}

# MapReduceTest.java

public static void main(String[] args) throws Exception {
    int res = ToolRunner.run(new Configuration(), new MapReduceTest(), args);
    System.exit(res);
}

@Override
public int run(String[] args) throws Exception {
    Configuration conf = this.getConf();
    Job job = Job.getInstance(conf, "MapReduce test");
    job.setJarByClass(MapReduceTest.class);
    job.setMapperClass(MyMap.class);
    job.setCombinerClass(MyReducer.class);
    job.setReducerClass(MyReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    job.setInputFormatClass(MyInputFormat.class);
    MyInputFormat.addInput(job, "ca73a799-9c71-4618-806e-7bd0ca1911f4");
    InputFormat.setEnvironmnet(job, "my.host.com", "443", true, "my_api_key");
    FileOutputFormat.setOutputPath(job, new Path(args[0]));
    return job.waitForCompletion(true) ? 0 : 1;
}

Any ideas about what is wrong?

BTW, which is the "good" InputSplit the RecordReader must use, the one given to the constructor or the one given in the initialize method? Anyway I've tried both options and the resulting error is the same :)

回答1:

The way I read your strack trace real is null on line 524.

But don't take my word for it. Slip an assert or system.out.println in there and check the value of real yourself.

NullPointerException almost always means you dotted off something you didn't expect to be null. Some libraries and collections will throw it at you as their way of saying "this can't be null".

Error: java.lang.NullPointerException at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:524)

To me this reads as: in the org.apache.hadoop.mapred package the MapTask class has an inner class NewTrackingRecordReader with an initialize method that threw a NullPointerException at line 524.

524 real.initialize( blah, blah) // I actually stopped reading after the dot

this.real was set on line 491.

491 this.real = inputFormat.createRecordReader(split, taskContext);

Assuming you haven't left out any more closely scoped reals that are masking the this.real then we need to look at inputFormat.createRecordReader(split, taskContext); If this can return null then it might be the culprit.

Turns out it will return null when backend is null.

@Override
public RecordReader<LongWritable, Text> createRecordReader(
    InputSplit split, 
    TaskAttemptContext context) {

    if (backend == null) {
        logger.info("Unable to create a MyRecordReader, " + 
                    "it seems the environment was not properly set");
        return null;
    }

    // create a record reader
    return new MyRecordReader(backend, split, context);
}

It looks like setEnvironmnet is supposed to set backend

# MyInputFormat.java

public static void setEnvironmnet(
    Job job, 
    String host, 
    String port, 
    boolean ssl, 
    String APIKey) {

    backend = new Backend(host, port, ssl, APIKey);
}

backend must be declared somewhere outside setEnvironment (or you'd be getting a compiler error).

If backend hasn't been set to something non-null upon construction and setEnvironmnet was not called before createRecordReader then you should expect to get exactly the NullPointerException you got.

UPDATE:

As you've noted, since setEnvironmnet() is static backend must be static as well. This means that you must be sure other instances aren't setting it to null.

回答2:

Solved. The problem is the backend variable is declared as static, i.e. it belongs to the java class and thus any other object changing that variable (e.g. to null) affects all the other objects of the same class.

Now, setEnvironment adds the host, port, ssl usage and the API key as configuration (the same than setResId already did with the resource ID); when createRecordReader is invoked this configuration is got and the backend object is created.

Thanks to CandiedOrange who put me in the right path!

来源：https://stackoverflow.com/questions/28213382/hadoop-nullpointerexception-with-custom-inputformat

标签

java

Hadoop

MapReduce