Stanford CoreNLP: Use partial existing annotation

烈酒焚心 提交于 2020-01-02 15:25:14

问题


We are trying to use existing

  • tokenzation
  • sentence splitting
  • and named entity tagging

while we would like to use Stanford CoreNlp to additionally provide us with

  • part-of-speech tagging
  • lemmatization
  • and parsing

Currently, we are trying it the following way:

1) make an annotator for "pos, lemma, parse"

Properties pipelineProps = new Properties();
pipelineProps.put("annotators", "pos, lemma, parse");
pipelineProps.setProperty("parse.maxlen", "80");
pipelineProps.setProperty("pos.maxlen", "80");
StanfordCoreNLP pipeline = new StanfordCoreNLP(pipelineProps);

2) read in the sentences, with a custom method:

List<CoreMap> sentences = getSentencesForTaggedFile(idToDoc.get(docId));

within that method, the tokens are constructed the following way:

CoreLabel clToken = new CoreLabel();
clToken.setValue(stringToken);
clToken.setWord(stringToken);
clToken.setOriginalText(stringToken);
clToken.set(CoreAnnotations.NamedEntityTagAnnotation.class, neTag);
sentenceTokens.add(clToken);

and they are combined into sentences like this:

Annotation sentence = new Annotation(sb.toString());
sentence.set(CoreAnnotations.TokensAnnotation.class, sentenceTokens);
sentence.set(CoreAnnotations.TokenBeginAnnotation.class, tokenOffset);
tokenOffset += sentenceTokens.size();
sentence.set(CoreAnnotations.TokenEndAnnotation.class, tokenOffset);
sentence.set(CoreAnnotations.SentenceIndexAnnotation.class, sentences.size());

3) the list of sentences is passed to the pipeline:

  Annotation document = new Annotation(sentences);
  pipeline.annotate(document);

However, when running this, we get the following error:

null: InvocationTargetException: annotator "pos" requires annotator "tokenize"

Any pointers how we can achieve what we want to do?


回答1:


The exception is thrown due to unsatisfied requirement expected by "pos" annotator (an instance of POSTaggerAnnotator class)

Requirements for annotators which StanfordCoreNLP knows how to create are defined in Annotator interface. For the case of "pos" annotator there are 2 requirements defined:

  • tokenize
  • ssplit

Both of this requirements needs to be satisfied, which means that both "tokenize" annotator and "ssplit" annotator must be specified in annotators list before "pos" annotator.

Now back to the question... If you like to skip "tokenize" and "ssplit" annotations in your pipeline you need to disable requirements check which is performed during initialization of the pipeline. I found two equivalent ways how this can be done:

  • Disable requirements enforcement in properties object passed to StanfordCoreNLP constructor:

    props.setProperty("enforceRequirements", "false");

  • Set enforceRequirements parameter of StanfordCoreNLP constructor to false

    StanfordCoreNLP pipeline = new StanfordCoreNLP(props, false);




回答2:


you should add the parameters "tokenize"

pipelineProps.put("annotators", "tokenize, pos, lemma, parse");


来源:https://stackoverflow.com/questions/26245422/stanford-corenlp-use-partial-existing-annotation

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!