Stanford Core NLP: Entity type non deterministic

人走茶凉 提交于 2019-12-05 13:44:06

I've looked over the code some, and here is a possible way to resolve this:

What you could do to solve this is load each of the 3 serialized CRF's with useKnownLCWords set to false, and serialize them again. Then supply the new serialized CRF's to your StanfordCoreNLP.

Here is a command for loading a serialized CRF with useKnownLCWords set to false, and then dumping it again:

java -mx600m -cp "*:." edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz -useKnownLCWords false -serializeTo classifiers/new.english.all.3class.distsim.crf.ser.gz

Put whatever names you want to obviously! This command assumes you are in stanford-corenlp-full-2015-04-20/ and have a directory classifiers with the serialized CRF's. Change as appropriate for your set up.

This command should load the serialized CRF, override with the useKnownLCWords set to false, and then re-dump the CRF to new.english.all.3class.distsim.crf.ser.gz

Then in your original code:

nerAnnotators.put("ner.model","comma-separated-list-of-paths-to-new-serialized-crfs");

Please let me know if this works or if it's not working, and I can look more deeply into this!

Here is the answer from the NER FAQ:

http://nlp.stanford.edu/software/crf-faq.shtml

Is the NER deterministic? Why do the results change for the same data?

Yes, the underlying CRF is deterministic. If you apply the NER to the same sentence more than once, though, it is possible to get different answers the second time. The reason for this is the NER remembers whether it has seen a word in lowercase form before.

The exact way this is used as a feature is in the word shape feature, which treats words such as "Brown" differently if it has or has not seen "brown" as a lowercase word before. If it has, the word shape will be "Initial upper, have seen all lowercase", and if it has not, the word shape will be "Initial upper, have not seen all lowercase".

This feature can be turned off in recent versions with the flag -useKnownLCWords false

prabhu shankar

After doing some research, I found the issue is in ClassifierCombiner.classify() method. One of the baseClassifiers edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz loaded by default is returning different type on some occasion. I am trying to load only the first model to resolve this issue.

The problem is the following area of the code

CRFClassifier.classifyMaxEnt()

int[] bestSequence = tagInference.bestSequence(model); Line 1249 

ExactBestSequenceFinder.bestSequence() is returning different sequence for for the above model for the same input when called multiple times.

Not sure if this needs code fix or some configuration changes to the model. Any additional insight is appreciated.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!