Stanford CoreNLP: Use partial existing annotation

We are trying to use existing

tokenzation
sentence splitting
and named entity tagging

while we would like to use Stanford CoreNlp to additionally provide us with

part-of-speech tagging
lemmatization
and parsing

Currently, we are trying it the following way:

1) make an annotator for "pos, lemma, parse"

Properties pipelineProps = new Properties();
pipelineProps.put("annotators", "pos, lemma, parse");
pipelineProps.setProperty("parse.maxlen", "80");
pipelineProps.setProperty("pos.maxlen", "80");
StanfordCoreNLP pipeline = new StanfordCoreNLP(pipelineProps);

2) read in the sentences, with a custom method:

List<CoreMap> sentences = getSentencesForTaggedFile(idToDoc.get(docId));

within that method, the tokens are constructed the following way:

CoreLabel clToken = new CoreLabel();
clToken.setValue(stringToken);
clToken.setWord(stringToken);
clToken.setOriginalText(stringToken);
clToken.set(CoreAnnotations.NamedEntityTagAnnotation.class, neTag);
sentenceTokens.add(clToken);

and they are combined into sentences like this:

Annotation sentence = new Annotation(sb.toString());
sentence.set(CoreAnnotations.TokensAnnotation.class, sentenceTokens);
sentence.set(CoreAnnotations.TokenBeginAnnotation.class, tokenOffset);
tokenOffset += sentenceTokens.size();
sentence.set(CoreAnnotations.TokenEndAnnotation.class, tokenOffset);
sentence.set(CoreAnnotations.SentenceIndexAnnotation.class, sentences.size());

3) the list of sentences is passed to the pipeline:

  Annotation document = new Annotation(sentences);
  pipeline.annotate(document);

However, when running this, we get the following error:

null: InvocationTargetException: annotator "pos" requires annotator "tokenize"

Any pointers how we can achieve what we want to do?

The exception is thrown due to unsatisfied requirement expected by "pos" annotator (an instance of POSTaggerAnnotator class)

Requirements for annotators which StanfordCoreNLP knows how to create are defined in Annotator interface. For the case of "pos" annotator there are 2 requirements defined:

tokenize
ssplit

Both of this requirements needs to be satisfied, which means that both "tokenize" annotator and "ssplit" annotator must be specified in annotators list before "pos" annotator.

Now back to the question... If you like to skip "tokenize" and "ssplit" annotations in your pipeline you need to disable requirements check which is performed during initialization of the pipeline. I found two equivalent ways how this can be done:

Disable requirements enforcement in properties object passed to StanfordCoreNLP constructor:

props.setProperty("enforceRequirements", "false");
Set enforceRequirements parameter of StanfordCoreNLP constructor to false

StanfordCoreNLP pipeline = new StanfordCoreNLP(props, false);

you should add the parameters "tokenize"

pipelineProps.put("annotators", "tokenize, pos, lemma, parse");

来源：https://stackoverflow.com/questions/26245422/stanford-corenlp-use-partial-existing-annotation

标签

nlp

stanford-nlp