Formatting NER output from Stanford Corenlp

谁都会走 提交于 2019-12-01 11:55:35

If you just want the complete strings of each named entity found by Stanford NER, try this:

String text = "<INSERT YOUR INPUT TEXT HERE>";
AbstractSequenceClassifier<CoreMap> ner = CRFClassifier.getDefaultClassifier();
List<Triple<String, Integer, Integer>> entities = ner.classifyToCharacterOffsets(text);
for (Triple<String, Integer, Integer> entity : entities)
    System.out.println(text.substring(entity.second, entity.third), entity.second));

In case you're wondering, the entity class is indicated by entity.first.

Alternatively, you can use ner.classifyWithInlineXML(text) to get output that looks like <PERSON>Bill Smith</PERSON> went to <LOCATION>Paris</LOCATION> .

No, CoreNLP 3.5.0 has no utility to merge the NER labels. The next release (coming sometime next week) has a new MentionsAnnotator which handles this merging for you. For now, you can (a) use the MentionsAnnotator, available on the CoreNLP master branch, or (b) merge manually.

Use the -outputFormat xml option to have CoreNLP output XML. (Is this what you want?)

You can set any property in the properties file, include the "outputFormat" property. Stanford CoreNLP supports several different formats such as json, xml, and text. However, the xml option is not an inlineXML format. The xml format gives per token annotations for NER.

    <tokens> 
      <token id="1"> 
        <word>New</word> 
        <lemma>New</lemma> 
        <CharacterOffsetBegin>0</CharacterOffsetBegin> 
        <CharacterOffsetEnd>3</CharacterOffsetEnd> 
        <POS>NNP</POS> 
        <NER>ORGANIZATION</NER> 
        <Speaker>PER0</Speaker> 
      </token> 
      <token id="2"> 
        <word>York</word> 
        <lemma>York</lemma> 
        <CharacterOffsetBegin>4</CharacterOffsetBegin> 
        <CharacterOffsetEnd>8</CharacterOffsetEnd> 
        <POS>NNP</POS> 
        <NER>ORGANIZATION</NER> 
        <Speaker>PER0</Speaker> 
      </token> 
      <token id="3"> 
        <word>Times</word> 
        <lemma>Times</lemma> 
        <CharacterOffsetBegin>9</CharacterOffsetBegin> 
        <CharacterOffsetEnd>14</CharacterOffsetEnd> 
        <POS>NNP</POS> 
        <NER>ORGANIZATION</NER> 
        <Speaker>PER0</Speaker> 
      </token> 
    </tokens> 

From Stanford CoreNLP 3.6 and onwards, You can use entitymentions in Pipeline and get list of all Entities. I have shown an example here. It works.

Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, regexner,entitymentions");
props.put("regexner.mapping", "jg-regexner.txt");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);


String inputText = "I have done Bachelor of Arts and Bachelor of Laws so that I can work at British Broadcasting Corporation"; 
Annotation annotation = new Annotation(inputText);

pipeline.annotate(annotation); 

List<CoreMap> multiWordsExp = annotation.get(MentionsAnnotation.class);
for (CoreMap multiWord : multiWordsExp) {
      String custNERClass = multiWord.get(NamedEntityTagAnnotation.class);
      System.out.println(multiWord +" : " +custNERClass);
}
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!