Why does this Pig UDF Result in an “Error: Java heap space” Given that I am Spilling the DataBag to Disk?

问题

Here is my UDF:

public DataBag exec(Tuple input) throws IOException { 
    Aggregate aggregatedOutput = null;

    int spillCount = 0;

    DataBag outputBag = BagFactory.newDefaultBag(); 
    DataBag values = (DataBag)input.get(0);
    for (Iterator<Tuple> iterator = values.iterator(); iterator.hasNext();) {
        Tuple tuple = iterator.next();
        //spillCount++;
        ...
        if (some condition regarding current input tuple){
            //do something to aggregatedOutput with information from input tuple
        } else {
            //Because input tuple does not apply to current aggregateOutput
            //return current aggregateOutput and apply input tuple
            //to new aggregateOutput
            Tuple returnTuple = aggregatedOutput.getTuple();
            outputBag.add(returnTuple);
            spillCount++;
            aggregatedOutputTuple = new Aggregate(tuple);


            if (spillCount == 1000) {
                outputBag.spill();
                spillCount = 0;
            }
        }
    }
    return outputBag; 
}

Please focus on the fact that for every 1000 input tuples, the bag spills to disk. I have set this number as low as 50 and as high as 100,000 yet still receive the memory error:

Pig logfile dump:

Backend error message
---------------------
Error: Java heap space

Pig Stack Trace
---------------
ERROR 2997: Unable to recreate exception from backed error: Error: Java heap space

What can I do to solve this? It is processing about a million rows.

HERE IS THE SOLUTION

Using the Accumulator interface:

public class Foo extends EvalFunc<DataBag> implements Accumulator<DataBag> {
    private DataBag outputBag = null;
    private UltraAggregation currentAggregation = null;

    public void accumulate(Tuple input) throws IOException {
        DataBag values = (DataBag)input.get(0);
        Aggregate aggregatedOutput = null;
        outputBag = BagFactory.getInstance().newDefaultBag();

        for (Iterator<Tuple> iterator = values.iterator(); iterator.hasNext();) {
            Tuple tuple = iterator.next();
            ...
            if (some condition regarding current input tuple){
                //do something to aggregatedOutput with information from input tuple
            } else {
                //Because input tuple does not apply to current aggregateOutput
                //return current aggregateOutput and apply input tuple
                //to new aggregateOutput
                outputBag.add(aggregatedOutput.getTuple());
                aggregatedOutputTuple = new Aggregate(tuple);
            }
        }
    }

    // Called when all tuples from current key have been passed to accumulate
    public DataBag getValue() {
        //Add final current aggregation
        outputBag.add(currentAggregation.getTuple());
        return outputBag;
    }
    // This is called after getValue()
    // Not sure if these commands are necessary as they are repeated in beginning of accumulate
    public void cleanup() {
        outputBag = null;
        currentAggregation = null;
    }

    public DataBag exec(Tuple input) throws IOException {
        // Same as above ^^ but this doesn't appear to ever be called.
    }

    public Schema outputSchema(Schema input) {
        try {
            return new Schema(new FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), input), bagSchema, DataType.BAG));
        } catch {FrontendException e) {
            e.printStackTrace();
            return null;
        }
    }

    class Aggregate {
        ...
        public Tuple getTuple() {
            Tuple output = TupleFactory.getInstance().newTuple(OUTPUT_TUPLE_SIZE);
            try {
                output.set(0, val);
                ...
            } catch (ExecException e) {
                e.printStackTrace();
                return null;
            }
        }
        ...
    }
}

回答1:

You should increment spillCount every time you append to outputBag, not every time you get a tuple from the iterator. You are only spilling whenever the spillCount is a multiple of 1000 AND your if condition is not met, which may not happen that often (depending on the logic). This may explain why you don't see much difference for different spill thresholds.

If that doesn't solve your problem I would try extending AccumulatorEvalFunc<DataBag>. In your case you don't actually need access to the whole bag. Your implementation fits with an accumulator style implementation because you only need access to the current tuple. This may reduce memory usage. Essentially you would have an instance variable of type DataBag that accumulates the final output. You would also have an instance variable for aggregatedOutput that would have the current aggregate. A call to accumulate() would either 1) update the current aggregate, or 2) add the current aggregate to aggregatedOutput and begin a new aggregate. This essentially follows the body of your for loop.

来源：https://stackoverflow.com/questions/21567307/why-does-this-pig-udf-result-in-an-error-java-heap-space-given-that-i-am-spil

标签

java

Hadoop

out-of-memory

apache-pig