StanfordNLP - ArrayIndexOutOfBoundsException at TokensRegexNERAnnotator.readEntries(TokensRegexNERAnnotator.java:696)

问题

I want to identify following as SKILL using stanfordNLP's TokensRegexNERAnnotator.

AREAS OF EXPERTISE Areas of Knowledge Computer Skills Technical Experience Technical Skills

There are many more sequence of text like above.

Code -

    Properties props = new Properties();
    props.put("annotators", "tokenize, ssplit, pos, lemma, ner");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    pipeline.addAnnotator(new TokensRegexNERAnnotator("./mapping/test_degree.rule", true));
    String[] tests = {"Bachelor of Arts is a good degree.", "Technical Skill is a must have for Software Developer."};
    List tokens = new ArrayList<>();

    // traversing each sentence from array of sentence.
    for (String txt : tests) {
         System.out.println("String is : " + txt);

         // create an empty Annotation just with the given text
         Annotation document = new Annotation(txt);

         pipeline.annotate(document);
         List<CoreMap> sentences = document.get(SentencesAnnotation.class);

         /* Next we can go over the annotated sentences and extract the annotated words,
         Using the CoreLabel Object */
      for (CoreMap sentence : sentences) {
         for (CoreLabel token : sentence.get(TokensAnnotation.class)) {
             System.out.println("annotated coreMap sentences : " + token);
             // Extracting NER tag for current token
             String ne = token.get(NamedEntityTagAnnotation.class);
             String word = token.get(CoreAnnotations.TextAnnotation.class);
             System.out.println("Current Word : " + word + " POS :" + token.get(PartOfSpeechAnnotation.class));
             System.out.println("Lemma : " + token.get(LemmaAnnotation.class));
             System.out.println("Named Entity : " + ne);
    }
  }

My regex rule file is -

tokens = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$TokensAnnotation" }

{ ruleType: "tokens", pattern: ($SKILL_FIRST_KEYWORD + $SKILL_KEYWORD), result: "SKILL" }

I am getting ArrayIndexOutOfBoundsException error. I guess there is something wrong with my rule file. Can somebody please point me where am I making mistake?

Desired Output -

AREAS OF EXPERTISE - SKILL

Areas of Knowledge - SKILL

Computer Skills - SKILL

and so on.

Thanks in advance.

回答1:

You should be using the TokensRegexAnnotator not the TokensRegexNERAnnotator.

You should review these threads for more info:

TokensRegex rules to get correct output for Named Entities

Getting output in the desired format using TokenRegex

回答2:

Above accepted Answer by @StanfordNLPHelp, helped me solve this problem. All credit goes to him/her.

I am just concluding how end code would look like to get output in desired format in the hope that it helps somebody.

First I changed in rule file

Then in code

props.put("annotators", "tokenize, ssplit, pos, lemma, ner");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

for (String txt : tests) {
     System.out.println("String is : " + txt);

     // create an empty Annotation just with the given text
     Annotation document = new Annotation(txt);

     pipeline.annotate(document);
     List<CoreMap> sentences = document.get(SentencesAnnotation.class);

     Env env = TokenSequencePattern.getNewEnv();
     env.setDefaultStringMatchFlags(NodePattern.CASE_INSENSITIVE);
     env.setDefaultStringPatternFlags(Pattern.CASE_INSENSITIVE);

     CoreMapExpressionExtractor extractor = CoreMapExpressionExtractor.createExtractorFromFiles(env, "test_degree.rules");
     for (CoreMap sentence : sentences) {
         List<MatchedExpression> matched = extractor.extractExpressions(sentence);
         for(MatchedExpression phrase : matched){
             // Print out matched text and value
             System.out.println("MATCHED ENTITY: " + phrase.getText() + " VALUE: " + phrase.getValue().get());
         }
    }
}

来源：https://stackoverflow.com/questions/43691901/stanfordnlp-arrayindexoutofboundsexception-at-tokensregexnerannotator-readentr

标签

java

nlp

stanford-nlp