TokensRegex rules to get correct output for Named Entities

泄露秘密 提交于 2019-12-01 11:46:51
  1. I have produced a jar that represents the latest Stanford CoreNLP on the main GitHub page (as of April 14).

  2. This command (with the latest code) should work for using the TokensRegexAnnotator (alternatively the tokensregex settings can be passed into a Properties object if using the Java API):

    java -Xmx8g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,tokensregex -tokensregex.rules example.rules -tokensregex.caseInsensitive -file example.txt -outputFormat text
    
  3. Here is a rule file I wrote that shows matching based on a sentence pattern:

    ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }
    
    $ORGANIZATION_TITLES = "/inc\.|corp\./"
    
    $COMPANY_INDICATOR_WORDS = "/company|corporation/"
    
    { pattern: (([{pos: NNP}]+ $ORGANIZATION_TITLES) /is/ /a/ $COMPANY_INDICATOR_WORDS), action: (Annotate($1, ner, "RULE_FOUND_ORG") ) }
    
    { pattern: ($COMPANY_INDICATOR_WORDS /that/ ([{pos: NNP}]+) /works/ /for/), action: (Annotate($1, ner, "RULE_FOUND_PERSON") ) }
    

    Note that $0 means the entire pattern, and $1 means the first capture group. So in this example, I put an extra parentheses around the text that represented what I wanted to match.

    I ran this on the example: Paragonix Inc. is a company that Joe Smith works for.

    This example shows using an extraction from a first round in a second round:

    ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }
    
    $ORGANIZATION_TITLES = "/inc\.|corp\./"
    
    $COMPANY_INDICATOR_WORDS = "/company|corporation/"
    
    ENV.defaults["stage"] = 1
    
    { pattern: (/works/ /for/ ([{pos: NNP}]+ $ORGANIZATION_TITLES)), action: (Annotate($1, ner, "RULE_FOUND_ORG") ) }
    
    ENV.defaults["stage"] = 2
    
    { pattern: (([{pos: NNP}]+) /works/ /for/ [{ner: "RULE_FOUND_ORG"}]), action: (Annotate($1, ner, "RULE_FOUND_PERSON") ) }
    

This example should work properly on the sentence Joe Smith works for Paragonix Inc.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!