Getting OOM while using GATE on large data set

问题

I am quite new to NLP and am using GATE for it. I am getting OOM Exception if I run my code for large data set(containing 7K+ records). Below is the code where exception occurs.

    /**
 * Run ANNIE
 * 
 * @param controller
 * @throws GateException
 */
public void execute(SerialAnalyserController controller)
        throws GateException {
    TestLogger.info("Running ANNIE...");
    controller.execute();     /**** GateProcessor.java:217 ***/

    // controller.cleanup();
    TestLogger.info("...ANNIE complete");
}

Here is the log :

    Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.HashMap.addEntry(Unknown Source)
at java.util.HashMap.put(Unknown Source)
at java.util.HashMap.putAll(Unknown Source)
at gate.annotation.AnnotationSetImpl.<init>(AnnotationSetImpl.java:111)
at gate.jape.SinglePhaseTransducer.attemptAdvance(SinglePhaseTransducer.java:448)
at gate.jape.SinglePhaseTransducer.transduce(SinglePhaseTransducer.java:287)
at gate.jape.MultiPhaseTransducer.transduce(MultiPhaseTransducer.java:168)
at gate.jape.Batch.transduce(Batch.java:352)
at gate.creole.Transducer.execute(Transducer.java:116)
at gate.creole.SerialController.runComponent(SerialController.java:177)
at gate.creole.SerialController.executeImpl(SerialController.java:136)
at gate.creole.SerialAnalyserController.executeImpl(SerialAnalyserController.java:67)
at gate.creole.AbstractController.execute(AbstractController.java:42)
at in.co.test.GateProcessor.execute(GateProcessor.java:217)

I would like to know what exactly is happening with execute function and how it can be resolved. Thanks.

回答1:

Processing large (or many) documents in GATE can require lots of memory, GATE needs lots of space to store annotations. On the other hand various processing resources require lots of memory as well: gazetteers, statistical model-based taggers, etc.

A trick in Gate developer GUI is to store the corpus of documents in a data store, then load only the corpus and run the pipeline. GATE is smart enough to load one document at a time, process it, then save & close it before opening the next one. (You can first store an empty corpus in a data store and then "populate" it from a folder, this will again load documents one by one without wasting memory.)

This is exactly what you should do in your code, open document, process, save and close before opening the next one. If you have a single large document you should split it (in a way that doesn't break your annotation performance).

Here is a code example from the "Advanced GATE Embedded" module:

// for each piece of text:

Document doc = (Document)Factory.createResource("gate.corpora.DocumentImpl",
              Utils.featureMap("stringContent", text, "mimeType", mime));
Corpus corpus = Factory.newCorpus("webapp corpus");
try {
  corpus.add(doc);
  application.execute();
  ...
finally {
  corpus.clear();
  Factory.deleteResource(doc);
}

来源：https://stackoverflow.com/questions/15082298/getting-oom-while-using-gate-on-large-data-set

标签

java

nlp

gate