Stanford NLP - OpenIE out of memory when processing list of files

十年热恋 提交于 2019-12-07 14:42:15

问题


I'm trying to extract information from several files using the OpenIE tool from Stanford CoreNLP, it gives an out of memory error when several files are passed to the input, instead of just one.

All files have been queued; awaiting termination...
java.lang.OutOfMemoryError: GC overhead limit exceeded
at edu.stanford.nlp.graph.DirectedMultiGraph.outgoingEdgeIterator(DirectedMultiGraph.java:508)
at edu.stanford.nlp.semgraph.SemanticGraph.outgoingEdgeIterator(SemanticGraph.java:165)
at edu.stanford.nlp.semgraph.semgrex.GraphRelation$GOVERNER$1.advance(GraphRelation.java:267)
at edu.stanford.nlp.semgraph.semgrex.GraphRelation$SearchNodeIterator.initialize(GraphRelation.java:1102)
at edu.stanford.nlp.semgraph.semgrex.GraphRelation$SearchNodeIterator.<init>(GraphRelation.java:1083)
at edu.stanford.nlp.semgraph.semgrex.GraphRelation$GOVERNER$1.<init>(GraphRelation.java:257)
at edu.stanford.nlp.semgraph.semgrex.GraphRelation$GOVERNER.searchNodeIterator(GraphRelation.java:257)
at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.resetChildIter(NodePattern.java:320)
at edu.stanford.nlp.semgraph.semgrex.CoordinationPattern$CoordinationMatcher.matches(CoordinationPattern.java:211)
at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.matchChild(NodePattern.java:514)
at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.matches(NodePattern.java:542)
at edu.stanford.nlp.naturalli.RelationTripleSegmenter.segmentVerb(RelationTripleSegmenter.java:541)
at edu.stanford.nlp.naturalli.RelationTripleSegmenter.segment(RelationTripleSegmenter.java:850)
at edu.stanford.nlp.naturalli.OpenIE.relationInFragment(OpenIE.java:354)
at edu.stanford.nlp.naturalli.OpenIE.lambda$relationsInFragments$2(OpenIE.java:366)
at edu.stanford.nlp.naturalli.OpenIE$$Lambda$76/1438896944.apply(Unknown Source)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.HashMap$KeySpliterator.forEachRemaining(HashMap.java:1540)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
at edu.stanford.nlp.naturalli.OpenIE.relationsInFragments(OpenIE.java:366)
at edu.stanford.nlp.naturalli.OpenIE.annotateSentence(OpenIE.java:486)
at edu.stanford.nlp.naturalli.OpenIE.lambda$annotate$3(OpenIE.java:554)
at edu.stanford.nlp.naturalli.OpenIE$$Lambda$25/606198361.accept(Unknown Source)
at java.util.ArrayList.forEach(ArrayList.java:1249)
at edu.stanford.nlp.naturalli.OpenIE.annotate(OpenIE.java:554)
at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:71)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:499)
at edu.stanford.nlp.naturalli.OpenIE.processDocument(OpenIE.java:630)
DONE processing files. 1 exceptions encountered.

I pass the files by input using this call:

java -mx3g -cp stanford-corenlp-3.6.0.jar:stanford-corenlp-3.6.0-models.jar:CoreNLP-to-HTML.xsl:slf4j-api.jar:slf4j-simple.jar edu.stanford.nlp.naturalli.OpenIE file1 file2 file3 etc.

I tried increasing the memory with -mx3g and other variants, and although the amount of processed files increases, it's not much (from 5 to 7, for eg.). Each file individually is processed correctly, so I'm excluding a file with big sentences or many lines.

Is there an option I'm not considering, some OpenIE or Java flag, something that I can use to force a dump to an output, a cleaning, or garbage collection between each file that is processed?

Thank you in advance


回答1:


From the comments above: I suspect this is an issue with too much parallelism and too little memory. OpenIE is a bit memory hungry, especially with long sentences, and so running many files in parallel can take up a fair bit of memory.

An easy fix is to force the program to run single-threaded, by setting the -threads 1 flag. If possible, increasing memory should help as well.




回答2:


Run this command to get a separate annotation per file (sample-file-list.txt should be one file per line)

java -Xmx4g -cp "stanford-corenlp-full-2015-12-09/*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,depparse,natlog,openie -filelist sample-file-list.txt -outputDirectory output_dir -outputFormat text


来源:https://stackoverflow.com/questions/36431900/stanford-nlp-openie-out-of-memory-when-processing-list-of-files

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!