TokensRegex rules to get correct output for Named Entities

问题

I was finally able to get my TokensRegex code to give some kind of output for named entities. But the output is not exactly what I want. I believe the rules need some tweaking.

Here's the code:

    public static void main(String[] args)
    {
        String  rulesFile = "D:\\Workspace\\resource\\NERRulesFile.rules.txt";
        String dataFile = "D:\\Workspace\\data\\GoldSetSentences.txt";

        Properties props = new Properties();
        props.put("annotators", "tokenize, ssplit, pos, lemma, ner");
        props.setProperty("ner.useSUTime", "0");
        StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
        pipeline.addAnnotator(new TokensRegexAnnotator(rulesFile));
        String inputText = "Bill Edelman, CEO and chairman of Paragonix Inc. announced that the company is expanding it's operations in China.";

        Annotation document = new Annotation(inputText);
        pipeline.annotate(document);
        Env env = TokenSequencePattern.getNewEnv();
        env.setDefaultStringMatchFlags(NodePattern.CASE_INSENSITIVE); 
        env.setDefaultStringPatternFlags(Pattern.CASE_INSENSITIVE);
        List<CoreMap> sentences = document.get(SentencesAnnotation.class);
        CoreMapExpressionExtractor extractor = CoreMapExpressionExtractor.createExtractorFromFiles(env, rulesFile);

        /* Next we can go over the annotated sentences and extract the annotated words,
         Using the CoreLabel Object */
        for (CoreMap sentence : sentences)
        {

            List<MatchedExpression> matched = extractor.extractExpressions(sentence);

            for(MatchedExpression phrase : matched){

                // Print out matched text and value
                System.out.println("matched: " + phrase.getText() + " with value: " + phrase.getValue());
                // Print out token information
                CoreMap cm = phrase.getAnnotation();
                for (CoreLabel token : cm.get(TokensAnnotation.class))
                {
                    if (token.tag().equals("NNP")){
                        String leftContext = token.before();
                        String rightContext = token.after();
                        System.out.println(leftContext);
                        System.out.println(rightContext);


                        String word = token.get(TextAnnotation.class);
                        String lemma = token.get(LemmaAnnotation.class);
                        String pos = token.get(PartOfSpeechAnnotation.class);
                        String ne = token.get(NamedEntityTagAnnotation.class);
                        System.out.println("matched token: " + "word="+word + ", lemma="+lemma + ", pos=" + pos + "ne=" + ne);
                    }

                }
            }
        }
    }
}

And here's the rules file:

$TITLES_CORPORATE  = (/chief/ /administrative/ /officer/|cao|ceo|/chief/ /executive/ /officer/|/chairman/|/vice/ /president/)
$ORGANIZATION_TITLES = (/International/|/inc\./|/corp/|/llc/)

# For detecting organization names like 'Paragonix Inc.' 

{    ruleType: "tokens",
     pattern: ([{pos: NNP}]+ $ORGANIZATION_TITLES),
     action: ( Annotate($0, ner, "ORGANIZATION"),Annotate($1, ner, "ORGANIZATION") ) 
}

# For extracting organization names from a pattern - 'Genome International is planning to expand its operations in China.' 
#(in the sentence given above the words planning and expand are part of the $OrgContextWords macros )
{
  ruleType: "tokens",
  pattern: (([{tag:/NNP.*/}]+) /,/*? /is|had|has|will|would/*? /has|had|have|will/*? /be|been|being/*? (?:[]{0,5}[{lemma:$OrgContextWords}]) /of|in|with|for|to|at|like|on/*?),
  result:  ( Annotate($1, ner, "ORGANIZATION") ) 
}

# For sentence like - Bill Edelman, Chairman and CEO of Paragonix Inc./ Zuckerberg CEO Facebook said today....  

ENV.defaults["stage"] = 1
{
  pattern: ( $TITLES_CORPORATE ), 
  action: ( Annotate($1, ner, "PERSON_TITLE")) 
}

ENV.defaults["stage"] = 2 
{
  ruleType: "tokens",
  pattern:  ( ([ { pos:NNP} ]+) /,/*? (?:TITLES_CORPORATE)? /and|&/*? (?:TITLES_CORPORATE)? /,/*? /of|for/? /,/*? [ { pos:NNP } ]+ ),
  result: (Annotate($1, ner, "PERSON"),Annotate($2, ner, "ORGANIZATION"))
}

The output I get is:

matched: Paragonix Inc. announced that the company is expanding with
value: LIST([LIST([ORGANIZATION, ORGANIZATION])])
matched token: word=Paragonix, lemma=Paragonix, pos=NNPne=PERSON
matched token: word=Inc., lemma=Inc., pos=NNP, ne=ORGANIZATION

The output I am expecting is:

matched: Paragonix Inc. announced that the company is expanding with
value: LIST([LIST([ORGANIZATION, ORGANIZATION])])
matched token: word=Paragonix, lemma=Paragonix, pos=NNPne=ORGANIZATION
matched token: word=Inc., lemma=Inc., pos=NNP, ne=ORGANIZATION

Also here Bill Edelman does not get identified as person. The phrase containing Bill Edelman does not get identified although I have a rule in place for it. Do I need to stage my rules for the entire phrase to get matched against each rule as a result not miss out on any entities?

回答1:

I have produced a jar that represents the latest Stanford CoreNLP on the main GitHub page (as of April 14).

This command (with the latest code) should work for using the TokensRegexAnnotator (alternatively the tokensregex settings can be passed into a Properties object if using the Java API):

java -Xmx8g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,tokensregex -tokensregex.rules example.rules -tokensregex.caseInsensitive -file example.txt -outputFormat text

Here is a rule file I wrote that shows matching based on a sentence pattern:

ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }

$ORGANIZATION_TITLES = "/inc\.|corp\./"

$COMPANY_INDICATOR_WORDS = "/company|corporation/"

{ pattern: (([{pos: NNP}]+ $ORGANIZATION_TITLES) /is/ /a/ $COMPANY_INDICATOR_WORDS), action: (Annotate($1, ner, "RULE_FOUND_ORG") ) }

{ pattern: ($COMPANY_INDICATOR_WORDS /that/ ([{pos: NNP}]+) /works/ /for/), action: (Annotate($1, ner, "RULE_FOUND_PERSON") ) }

Note that $0 means the entire pattern, and $1 means the first capture group. So in this example, I put an extra parentheses around the text that represented what I wanted to match.

I ran this on the example: Paragonix Inc. is a company that Joe Smith works for.

This example shows using an extraction from a first round in a second round:

ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }

$ORGANIZATION_TITLES = "/inc\.|corp\./"

$COMPANY_INDICATOR_WORDS = "/company|corporation/"

ENV.defaults["stage"] = 1

{ pattern: (/works/ /for/ ([{pos: NNP}]+ $ORGANIZATION_TITLES)), action: (Annotate($1, ner, "RULE_FOUND_ORG") ) }

ENV.defaults["stage"] = 2

{ pattern: (([{pos: NNP}]+) /works/ /for/ [{ner: "RULE_FOUND_ORG"}]), action: (Annotate($1, ner, "RULE_FOUND_PERSON") ) }

This example should work properly on the sentence Joe Smith works for Paragonix Inc.

来源：https://stackoverflow.com/questions/43447585/tokensregex-rules-to-get-correct-output-for-named-entities

标签

stanford-nlp