Getting output in the desired format using TokenRegex

问题

I am using TokensRegex for rule based entity extraction. It works well but I am having trouble getting my output in the desired format. The following snippet of code gives me an output given below for the sentence:

Earlier this month Trump targeted Toyota, threatening to impose a hefty fee on the world's largest automaker if it builds its Corolla cars for the U.S. market at a plant in Mexico.

for (CoreMap sentence : sentences)
            {

                List<MatchedExpression> matched = extractor.extractExpressions(sentence);

                if (matched != null) {

                    matched = MatchedExpression.removeNested(matched);
                    matched = MatchedExpression.removeNullValues(matched);
                    System.out.print("FOR SENTENCE:" + sentence);
                }

                for(MatchedExpression phrase : matched){

                    // Print out matched text and value

                    System.out.print("MATCHED ENTITY: " + phrase.getText()+ "\t" + "VALUE: " + phrase.getValue());

OUTPUT

MATCHED ENTITY: Donald Trump targeted Toyota, threatening to impose a hefty fee on the world's largest automaker if it builds its Corolla cars for the U.S. market  

VALUE: LIST([PERSON])

I know if I iterate over tokens using :

for (CoreLabel token : cm.get(TokensAnnotation.class))
                    {String word = token.get(TextAnnotation.class);
                            String lemma = token.get(LemmaAnnotation.class);
                            String pos = token.get(PartOfSpeechAnnotation.class);
                            String ne = token.get(NamedEntityTagAnnotation.class);
                            System.out.println("matched token: " + "word="+word + ", lemma="+lemma + ", pos=" + pos + ", NE=" + ne);
}

I can get an output that gives annotation for each tag. However, I am using my own rules to detect Named Entities and I have sometimes seen issues where in a multi token entity one word from it may be tagged as person where the where multi token expression should have been an organization (mostly in the case of Organization and location names)

So the output I am expecting is:

MATCHED ENTITY: Donald Trump VALUE: PERSON
MATCHED ENTITY: Toyota VALUE: ORGANIZATION

How do I change the above code to get the desired output? Do I need to use custom annotations?

回答1:

I produced a jar of the latest build a week or so ago. Use that jar available from GitHub.

This sample code will run the rules and apply the appropriate ner tags.

package edu.stanford.nlp.examples;

import edu.stanford.nlp.util.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;

import java.util.*;


public class TokensRegexExampleTwo {

  public static void main(String[] args) {

    // set up properties
    Properties props = new Properties();
    props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,tokensregex");
    props.setProperty("tokensregex.rules", "multi-step-per-org.rules");
    props.setProperty("tokensregex.caseInsensitive", "true");

    // set up pipeline
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

    // set up text to annotate
    Annotation annotation = new Annotation("...text to annotate...");

    // annotate text
    pipeline.annotate(annotation);

    // print out found entities
    for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) {
      for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
        System.out.println(token.word() + "\t" + token.ner());
      }
    }
  }
}

回答2:

I managed to get output in desired format.

Annotation document = new Annotation(<Sentence to annotate>);

//use the pipeline to annotate the document we created
pipeline.annotate(document);
List<CoreMap> sentences = document.get(SentencesAnnotation.class);

//Note- I doesn't put environment related stuff in rule file.
Env env = TokenSequencePattern.getNewEnv();
env.setDefaultStringMatchFlags(NodePattern.CASE_INSENSITIVE);
env.setDefaultStringPatternFlags(Pattern.CASE_INSENSITIVE);


CoreMapExpressionExtractor extractor = CoreMapExpressionExtractor
      .createExtractorFromFiles(env, "test_degree.rules");

for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) {
      List<MatchedExpression> matched = extractor.extractExpressions(sentence);
      for(MatchedExpression phrase : matched){
      // Print out matched text and value
      System.out.println("MATCHED ENTITY: " + phrase.getText() + " VALUE: " + phrase.getValue().get());
      }
    }

Output:

MATCHED ENTITY: Technical Skill VALUE: SKILL

You might want to have a look at my rule file in this question.

Hope this helps!

回答3:

Answering my own question for those struggling with a similar issue. THe key to getting your output in the correct format lies in how you define your rules in the rules file. Here's what I changed in the rules to change the output:

Old Rule:

{    ruleType: "tokens",
     pattern: (([pos:/NNP.*/ | pos:/NN.*/]+) ($LocWords)),
     result: Annotate($1, ner, "LOCATION"),

}

New Rule

{    ruleType: "tokens",
     pattern: (([pos:/NNP.*/ | pos:/NN.*/]+) ($LocWords)),
     action: Annotate($1, ner, "LOCATION"),
     result: "LOCATION"

}

How you define your result field defines the output format of your data.

Hope this helps!

来源：https://stackoverflow.com/questions/43521697/getting-output-in-the-desired-format-using-tokenregex

标签

stanford-nlp