Formatting NER output from Stanford Corenlp

问题

I am working with Stanford CoreNLP and using it for NER. But when I extract organization names, I see that each word is tagged with the annotation. So, if the entity is "NEW YORK TIMES", then it is getting recorded as three different entities : "NEW", "YORK" and "TIMES". Is there a property we can set in the Stanford COreNLP so that we could get the combined output as the entity ?

Just like in Stanford NER, when we use command line utility, we can choose out output format as : inlineXML ? Can we somehow set a property to select the output format in Stanford CoreNLP ?

回答1:

If you just want the complete strings of each named entity found by Stanford NER, try this:

String text = "<INSERT YOUR INPUT TEXT HERE>";
AbstractSequenceClassifier<CoreMap> ner = CRFClassifier.getDefaultClassifier();
List<Triple<String, Integer, Integer>> entities = ner.classifyToCharacterOffsets(text);
for (Triple<String, Integer, Integer> entity : entities)
    System.out.println(text.substring(entity.second, entity.third), entity.second));

In case you're wondering, the entity class is indicated by entity.first.

Alternatively, you can use ner.classifyWithInlineXML(text) to get output that looks like <PERSON>Bill Smith</PERSON> went to <LOCATION>Paris</LOCATION> .

回答2:

No, CoreNLP 3.5.0 has no utility to merge the NER labels. The next release (coming sometime next week) has a new MentionsAnnotator which handles this merging for you. For now, you can (a) use the MentionsAnnotator, available on the CoreNLP master branch, or (b) merge manually.

Use the -outputFormat xml option to have CoreNLP output XML. (Is this what you want?)

回答3:

You can set any property in the properties file, include the "outputFormat" property. Stanford CoreNLP supports several different formats such as json, xml, and text. However, the xml option is not an inlineXML format. The xml format gives per token annotations for NER.

    <tokens> 
      <token id="1"> 
        <word>New</word> 
        <lemma>New</lemma> 
        <CharacterOffsetBegin>0</CharacterOffsetBegin> 
        <CharacterOffsetEnd>3</CharacterOffsetEnd> 
        <POS>NNP</POS> 
        <NER>ORGANIZATION</NER> 
        <Speaker>PER0</Speaker> 
      </token> 
      <token id="2"> 
        <word>York</word> 
        <lemma>York</lemma> 
        <CharacterOffsetBegin>4</CharacterOffsetBegin> 
        <CharacterOffsetEnd>8</CharacterOffsetEnd> 
        <POS>NNP</POS> 
        <NER>ORGANIZATION</NER> 
        <Speaker>PER0</Speaker> 
      </token> 
      <token id="3"> 
        <word>Times</word> 
        <lemma>Times</lemma> 
        <CharacterOffsetBegin>9</CharacterOffsetBegin> 
        <CharacterOffsetEnd>14</CharacterOffsetEnd> 
        <POS>NNP</POS> 
        <NER>ORGANIZATION</NER> 
        <Speaker>PER0</Speaker> 
      </token> 
    </tokens>

回答4:

From Stanford CoreNLP 3.6 and onwards, You can use entitymentions in Pipeline and get list of all Entities. I have shown an example here. It works.

Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, regexner,entitymentions");
props.put("regexner.mapping", "jg-regexner.txt");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);


String inputText = "I have done Bachelor of Arts and Bachelor of Laws so that I can work at British Broadcasting Corporation"; 
Annotation annotation = new Annotation(inputText);

pipeline.annotate(annotation); 

List<CoreMap> multiWordsExp = annotation.get(MentionsAnnotation.class);
for (CoreMap multiWord : multiWordsExp) {
      String custNERClass = multiWord.get(NamedEntityTagAnnotation.class);
      System.out.println(multiWord +" : " +custNERClass);
}

来源：https://stackoverflow.com/questions/27852400/formatting-ner-output-from-stanford-corenlp

标签

stanford-nlp