Stanford NLP - OpenIE out of memory when processing list of files

I'm trying to extract information from several files using the OpenIE tool from Stanford CoreNLP, it gives an out of memory error when several files are passed to the input, instead of just one.

All files have been queued; awaiting termination...
java.lang.OutOfMemoryError: GC overhead limit exceeded
at edu.stanford.nlp.graph.DirectedMultiGraph.outgoingEdgeIterator(DirectedMultiGraph.java:508)
at edu.stanford.nlp.semgraph.SemanticGraph.outgoingEdgeIterator(SemanticGraph.java:165)
at edu.stanford.nlp.semgraph.semgrex.GraphRelation$GOVERNER$1.advance(GraphRelation.java:267)
at edu.stanford.nlp.semgraph.semgrex.GraphRelation$SearchNodeIterator.initialize(GraphRelation.java:1102)
at edu.stanford.nlp.semgraph.semgrex.GraphRelation$SearchNodeIterator.<init>(GraphRelation.java:1083)
at edu.stanford.nlp.semgraph.semgrex.GraphRelation$GOVERNER$1.<init>(GraphRelation.java:257)
at edu.stanford.nlp.semgraph.semgrex.GraphRelation$GOVERNER.searchNodeIterator(GraphRelation.java:257)
at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.resetChildIter(NodePattern.java:320)
at edu.stanford.nlp.semgraph.semgrex.CoordinationPattern$CoordinationMatcher.matches(CoordinationPattern.java:211)
at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.matchChild(NodePattern.java:514)
at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.matches(NodePattern.java:542)
at edu.stanford.nlp.naturalli.RelationTripleSegmenter.segmentVerb(RelationTripleSegmenter.java:541)
at edu.stanford.nlp.naturalli.RelationTripleSegmenter.segment(RelationTripleSegmenter.java:850)
at edu.stanford.nlp.naturalli.OpenIE.relationInFragment(OpenIE.java:354)
at edu.stanford.nlp.naturalli.OpenIE.lambda$relationsInFragments$2(OpenIE.java:366)
at edu.stanford.nlp.naturalli.OpenIE$$Lambda$76/1438896944.apply(Unknown Source)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.HashMap$KeySpliterator.forEachRemaining(HashMap.java:1540)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
at edu.stanford.nlp.naturalli.OpenIE.relationsInFragments(OpenIE.java:366)
at edu.stanford.nlp.naturalli.OpenIE.annotateSentence(OpenIE.java:486)
at edu.stanford.nlp.naturalli.OpenIE.lambda$annotate$3(OpenIE.java:554)
at edu.stanford.nlp.naturalli.OpenIE$$Lambda$25/606198361.accept(Unknown Source)
at java.util.ArrayList.forEach(ArrayList.java:1249)
at edu.stanford.nlp.naturalli.OpenIE.annotate(OpenIE.java:554)
at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:71)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:499)
at edu.stanford.nlp.naturalli.OpenIE.processDocument(OpenIE.java:630)
DONE processing files. 1 exceptions encountered.

I pass the files by input using this call:

java -mx3g -cp stanford-corenlp-3.6.0.jar:stanford-corenlp-3.6.0-models.jar:CoreNLP-to-HTML.xsl:slf4j-api.jar:slf4j-simple.jar edu.stanford.nlp.naturalli.OpenIE file1 file2 file3 etc.

I tried increasing the memory with -mx3g and other variants, and although the amount of processed files increases, it's not much (from 5 to 7, for eg.). Each file individually is processed correctly, so I'm excluding a file with big sentences or many lines.

Is there an option I'm not considering, some OpenIE or Java flag, something that I can use to force a dump to an output, a cleaning, or garbage collection between each file that is processed?

Thank you in advance

From the comments above: I suspect this is an issue with too much parallelism and too little memory. OpenIE is a bit memory hungry, especially with long sentences, and so running many files in parallel can take up a fair bit of memory.

An easy fix is to force the program to run single-threaded, by setting the -threads 1 flag. If possible, increasing memory should help as well.

Run this command to get a separate annotation per file (sample-file-list.txt should be one file per line)

java -Xmx4g -cp "stanford-corenlp-full-2015-12-09/*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,depparse,natlog,openie -filelist sample-file-list.txt -outputDirectory output_dir -outputFormat text

来源：https://stackoverflow.com/questions/36431900/stanford-nlp-openie-out-of-memory-when-processing-list-of-files

标签

java

stanford-nlp