How to NER and POS tag a pre-tokenized text with Stanford CoreNLP?

半城伤御伤魂 提交于 2019-12-03 16:43:49

If you set the property:

tokenize.whitespace = true

then the CoreNLP pipeline will tokenize on whitespace rather than the default PTB tokenization. You may also want to set:

ssplit.eolonly = true

so that you only split sentences on newline characters.

To programmatically run a classifier over a list of tokens that you've already gotten via some other means, without a kludge like pasting them together with whitespace and then tokenizing again, you can use the Sentence.toCoreLabelList method:

String[] token_strs = {"John", "met", "Amy", "in", "Los", "Angeles"};
List<CoreLabel> tokens = edu.stanford.nlp.ling.Sentence.toCoreLabelList(token_strs);
for (CoreLabel cl : classifier.classifySentence(tokens)) {
  System.out.println(cl.toShorterString());
}

Output:

[Value=John Text=John Position=0 Answer=PERSON Shape=Xxxx DistSim=463]
[Value=met Text=met Position=1 Answer=O Shape=xxxk DistSim=476]
[Value=Amy Text=Amy Position=2 Answer=PERSON Shape=Xxx DistSim=396]
[Value=in Text=in Position=3 Answer=O Shape=xxk DistSim=510]
[Value=Los Text=Los Position=4 Answer=LOCATION Shape=Xxx DistSim=449]
[Value=Angeles Text=Angeles Position=5 Answer=LOCATION Shape=Xxxxx DistSim=199]
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!