问题
I am trying the newest version of Stanford CoreNLP. When I extract location or organisation names, I see that every single word is tagged with the annotation. So, if the entity is "NEW YORK TIMES", then it is getting recorded as three different entities : "NEW", "YORK" and "TIMES". I find that the newest CoreNLP have "entitymentions" annotator. I think this annotator may help me to solve this problem. However, there is no usage instruction or example for this annotator. Could anyone give me more info about this new feature?
回答1:
Take a look at the mentions annotation key. This should be attached to a sentence, and contain a list of CoreMaps corresponding to each mention. So, there should be a CoreMap in there that corresponds to the mention of "New York Times".
回答2:
I guess no annotator will annotate NEW YORK TIMES as a single entity, unless you train the model with such dataset.
Stanford NER and POS tagger is trained with some datasets, based on it it will annotate the entities. (I saw, it has some text dictionary list of people, location, organization in stanford source file. It would be deciding which entities to be annotated).
Trained dataset can annotate Newyork as a entity, if you want to annotate NEW YORK TIME as a entity then in that case you have to train with such datasets.
I tested with this annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref
.
Query: New York Times is really nice.
Result : [Text=New CharacterOffsetBegin=0 CharacterOffsetEnd=3 PartOfSpeech=NNP Lemma=New NamedEntityTag=ORGANIZATION] [Text=York CharacterOffsetBegin=4 CharacterOffsetEnd=8 PartOfSpeech=NNP Lemma=York NamedEntityTag=ORGANIZATION] [Text=Times CharacterOffsetBegin=9 CharacterOffsetEnd=14 PartOfSpeech=NNP Lemma=Times NamedEntityTag=ORGANIZATION] [Text=is CharacterOffsetBegin=15 CharacterOffsetEnd=17 PartOfSpeech=VBZ Lemma=be NamedEntityTag=O] [Text=really CharacterOffsetBegin=18 CharacterOffsetEnd=24 PartOfSpeech=RB Lemma=really NamedEntityTag=O] [Text=nice CharacterOffsetBegin=25 CharacterOffsetEnd=29 PartOfSpeech=JJ Lemma=nice NamedEntityTag=O] [Text=. CharacterOffsetBegin=29 CharacterOffsetEnd=30 PartOfSpeech=. Lemma=. NamedEntityTag=O]
Query: Newyork times
Result : [Text=Newyork CharacterOffsetBegin=0 CharacterOffsetEnd=7 PartOfSpeech=NNP Lemma=Newyork NamedEntityTag=LOCATION] [Text=times CharacterOffsetBegin=8 CharacterOffsetEnd=13 PartOfSpeech=NNS Lemma=time NamedEntityTag=O]
回答3:
Try
Integer entityMentionIndex = coreLabel.get(CoreAnnotations.EntityMentionIndexAnnotation.class);
If you try it with string "New York Times newspaper is distributed in California"
, you can see the entityMentionIndex
is 0
(zero) for each word New, York and Times. That means if the index is same then those words are single entity.
来源:https://stackoverflow.com/questions/29667479/how-to-use-entitymentions-annotator-in-stanford-corenlp