How can I detect named entities that have more than 1 word using CoreNLP's RegexNER?

I am using the RegexNER annotator in CoreNLP and some of my named entities consist of multiple words. Excerpt from my mapping file:

RAF inhibitor DRUG_CLASS

Gilbert's syndrome DISEASE

The first one gets detected but each word gets the annotation DRUG_CLASS and there seems to be no way to link the words, like an NER id which both words would have.

The second case does not get detected at all and that's probably because the tokenizer treats the apostrophe after Gilbert as a separate token. Since RegexNER has the tokenization as a dependency, I can't really get around it.

Any suggestions to resolve these cases?

If you use the entitymentions annotator that will create entity mentions out of consecutive tokens with the same ner tags. There is the downside that if two entities of the same type are side by side they will be joined together. We are working on improving the ner system so we may include a new model that finds the boundaries of distinct mentions in these cases, hopefully this will go into Stanford CoreNLP 3.8.0.

Here is some sample code for accessing the entity mentions:

package edu.stanford.nlp.examples;

import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.util.*;

import java.util.*;

public class EntityMentionsExample {

  public static void main(String[] args) {
    Annotation document =
        new Annotation("John Smith visted Los Angeles on Tuesday.");
    Properties props = new Properties();
    props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,entitymentions");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    pipeline.annotate(document);

    for (CoreMap entityMention : document.get(CoreAnnotations.MentionsAnnotation.class)) {
      System.out.println(entityMention);
      System.out.println(entityMention.get(CoreAnnotations.TextAnnotation.class));
    }
  }
}

If you just have your rules tokenized the same way as the tokenizer it will work fine, so for instance the rule should be Gilbert 's syndrome.

So you could just run the tokenizer on all your text patterns and this problem will go away.

来源：https://stackoverflow.com/questions/43100750/how-can-i-detect-named-entities-that-have-more-than-1-word-using-corenlps-regex

标签

stanford-nlp