问题
I want to identify following as SKILL using stanfordNLP's TokensRegexNERAnnotator.
AREAS OF EXPERTISE
Areas of Knowledge
Computer Skills
Technical Experience
Technical Skills
There are many more sequence of text like above.
Code -
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
pipeline.addAnnotator(new TokensRegexNERAnnotator("./mapping/test_degree.rule", true));
String[] tests = {"Bachelor of Arts is a good degree.", "Technical Skill is a must have for Software Developer."};
List tokens = new ArrayList<>();
// traversing each sentence from array of sentence.
for (String txt : tests) {
System.out.println("String is : " + txt);
// create an empty Annotation just with the given text
Annotation document = new Annotation(txt);
pipeline.annotate(document);
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
/* Next we can go over the annotated sentences and extract the annotated words,
Using the CoreLabel Object */
for (CoreMap sentence : sentences) {
for (CoreLabel token : sentence.get(TokensAnnotation.class)) {
System.out.println("annotated coreMap sentences : " + token);
// Extracting NER tag for current token
String ne = token.get(NamedEntityTagAnnotation.class);
String word = token.get(CoreAnnotations.TextAnnotation.class);
System.out.println("Current Word : " + word + " POS :" + token.get(PartOfSpeechAnnotation.class));
System.out.println("Lemma : " + token.get(LemmaAnnotation.class));
System.out.println("Named Entity : " + ne);
}
}
My regex rule file is -
$SKILL_FIRST_KEYWORD = "/area of/|/areas of/|/technical/|/computer/|/professional/" $SKILL_KEYWORD = "/knowledge/|/skill/|/skills/|/expertise/|/experience/"
tokens = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$TokensAnnotation" }
{ ruleType: "tokens", pattern: ($SKILL_FIRST_KEYWORD + $SKILL_KEYWORD), result: "SKILL" }
I am getting ArrayIndexOutOfBoundsException
error. I guess there is something wrong with my rule file. Can somebody please point me where am I making mistake?
Desired Output -
AREAS OF EXPERTISE - SKILL
Areas of Knowledge - SKILL
Computer Skills - SKILL
and so on.
Thanks in advance.
回答1:
You should be using the TokensRegexAnnotator not the TokensRegexNERAnnotator.
You should review these threads for more info:
TokensRegex rules to get correct output for Named Entities
Getting output in the desired format using TokenRegex
回答2:
Above accepted Answer by @StanfordNLPHelp, helped me solve this problem. All credit goes to him/her.
I am just concluding how end code would look like to get output in desired format in the hope that it helps somebody.
First I changed in rule file
$SKILL_FIRST_KEYWORD = "/area of|areas of|Technical|computer|professional/"
$SKILL_KEYWORD = "/knowledge|skill|skills|expertise|experience/"
Then in code
props.put("annotators", "tokenize, ssplit, pos, lemma, ner");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
for (String txt : tests) {
System.out.println("String is : " + txt);
// create an empty Annotation just with the given text
Annotation document = new Annotation(txt);
pipeline.annotate(document);
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
Env env = TokenSequencePattern.getNewEnv();
env.setDefaultStringMatchFlags(NodePattern.CASE_INSENSITIVE);
env.setDefaultStringPatternFlags(Pattern.CASE_INSENSITIVE);
CoreMapExpressionExtractor extractor = CoreMapExpressionExtractor.createExtractorFromFiles(env, "test_degree.rules");
for (CoreMap sentence : sentences) {
List<MatchedExpression> matched = extractor.extractExpressions(sentence);
for(MatchedExpression phrase : matched){
// Print out matched text and value
System.out.println("MATCHED ENTITY: " + phrase.getText() + " VALUE: " + phrase.getValue().get());
}
}
}
来源:https://stackoverflow.com/questions/43691901/stanfordnlp-arrayindexoutofboundsexception-at-tokensregexnerannotator-readentr