Stanford Core NLP: Entity type non deterministic

问题

I had built a java parser using Stanford Core NLP. I am finding an issue in getting the consistent results with the CORENLP object. I am getting the different entity types for the same input text. It seems like a bug to me in CoreNLP. Wondering if any of the StanfordNLP users have encountered this issue and found workaround for the same. This is my Service class which I am instantiating and reusing.

    class StanfordNLPService {
        //private static final Logger logger = LogConfiguration.getInstance().getLogger(StanfordNLPServer.class.getName());
        private StanfordCoreNLP nerPipeline;
       /*
           Initialize the nlp instances for ner and sentiments.
         */
        public void init() {
            Properties nerAnnotators = new Properties();
            nerAnnotators.put("annotators", "tokenize,ssplit,pos,lemma,ner");
            nerPipeline = new StanfordCoreNLP(nerAnnotators);


        }

        /**
         * @param text               Text from entities to be extracted.

         */
        public void printEntities(String text) {

            //        boolean tracking = PerformanceMonitor.start("StanfordNLPServer.getEntities");
            try {

                // Properties nerAnnotators = new Properties();
                // nerAnnotators.put("annotators", "tokenize,ssplit,pos,lemma,ner");
                // nerPipeline = new StanfordCoreNLP(nerAnnotators); 
               Annotation document = nerPipeline.process(text);
                // a CoreMap is essentially a Map that uses class objects as keys and has values with custom types
                List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);

                for (CoreMap sentence : sentences) {
                    for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
                        // Get the entity type and offset information needed.
                        String currEntityType = token.get(CoreAnnotations.NamedEntityTagAnnotation.class);  // Ner type
                        int currStart = token.get(CoreAnnotations.CharacterOffsetBeginAnnotation.class);    // token offset_start
                        int currEnd = token.get(CoreAnnotations.CharacterOffsetEndAnnotation.class);        // token offset_end.
                        String currPos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class);           // POS type
                        System.out.println("(Type:value:offset)\t" + currEntityType + ":\t"+ text.substring(currStart,currEnd)+"\t" + currStart);
                    }
                }
            }catch(Exception e){
                e.printStackTrace();

            }
        }

    }
Discrepancy result: type changed from MISC to O from the initial use.
Iteration 1:
(Type:value:offset) MISC:   Appropriate 100
(Type:value:offset) MISC:   Time    112
Iteration 2:
(Type:value:offset) O:  Appropriate 100
(Type:value:offset) O:  Time    112

回答1:

I've looked over the code some, and here is a possible way to resolve this:

What you could do to solve this is load each of the 3 serialized CRF's with useKnownLCWords set to false, and serialize them again. Then supply the new serialized CRF's to your StanfordCoreNLP.

Here is a command for loading a serialized CRF with useKnownLCWords set to false, and then dumping it again:

java -mx600m -cp "*:." edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz -useKnownLCWords false -serializeTo classifiers/new.english.all.3class.distsim.crf.ser.gz

Put whatever names you want to obviously! This command assumes you are in stanford-corenlp-full-2015-04-20/ and have a directory classifiers with the serialized CRF's. Change as appropriate for your set up.

This command should load the serialized CRF, override with the useKnownLCWords set to false, and then re-dump the CRF to new.english.all.3class.distsim.crf.ser.gz

Then in your original code:

nerAnnotators.put("ner.model","comma-separated-list-of-paths-to-new-serialized-crfs");

Please let me know if this works or if it's not working, and I can look more deeply into this!

回答2:

Here is the answer from the NER FAQ:

http://nlp.stanford.edu/software/crf-faq.shtml

Is the NER deterministic? Why do the results change for the same data?

Yes, the underlying CRF is deterministic. If you apply the NER to the same sentence more than once, though, it is possible to get different answers the second time. The reason for this is the NER remembers whether it has seen a word in lowercase form before.

The exact way this is used as a feature is in the word shape feature, which treats words such as "Brown" differently if it has or has not seen "brown" as a lowercase word before. If it has, the word shape will be "Initial upper, have seen all lowercase", and if it has not, the word shape will be "Initial upper, have not seen all lowercase".

This feature can be turned off in recent versions with the flag -useKnownLCWords false

回答3:

After doing some research, I found the issue is in ClassifierCombiner.classify() method. One of the baseClassifiers edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz loaded by default is returning different type on some occasion. I am trying to load only the first model to resolve this issue.

The problem is the following area of the code

CRFClassifier.classifyMaxEnt()

int[] bestSequence = tagInference.bestSequence(model); Line 1249

ExactBestSequenceFinder.bestSequence() is returning different sequence for for the above model for the same input when called multiple times.

Not sure if this needs code fix or some configuration changes to the model. Any additional insight is appreciated.

来源：https://stackoverflow.com/questions/31679761/stanford-core-nlp-entity-type-non-deterministic

标签

java

stanford-nlp