Stanford NLP Tokens Regex — doesn't recognize NER

别来无恙 提交于 2021-01-29 05:12:08

问题


I'm just barely getting started with Tokens Regex. I haven't really found an intro or tutorial that gives me what I need. (If I've missed something, links are appreciated!)

The super short, bare-bones idea is that I want to do something like using

pattern: ( ( [ { ner:PERSON } ]) /was/ /born/ /on/ ([ { ner:DATE } ]) )

(from https://nlp.stanford.edu/software/tokensregex.html)

to match "John Smith was born on March 1, 1999", and then be able to extract "John Smith" as the person and "March 1, 1999" as the date.

I've cobbled together the following from a couple of web searches. I can get the simple Java regex /John/ to match, but nothing I've tried (all copied from web searches for examples, and tweaked a bit) matches when I use an NER.

EDIT for clarity: (Success/failure at the moment is true/false from matcher2.matches() in the code below.)

I don't know if I need to explicitly mention some model or an annotation or something, or if I'm missing something else, or if I'm just approaching it entirely the wrong way.

Any insights are much appreciated! Thanks!

import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.ling.tokensregex.TokenSequenceMatcher;
import edu.stanford.nlp.ling.tokensregex.TokenSequencePattern;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.util.CoreMap;
import java.util.ArrayList;
import java.util.List;
import java.util.Properties;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.junit.Test;

public class StanfordSandboxTest {
    private static final Log log = LogFactory.getLog(StanfordSandboxTest.class);

    @Test
    public void testFirstAttempt() {

        Properties props2;
        StanfordCoreNLP pipeline2;
        TokenSequencePattern pattern2;
        Annotation document2;
        List<CoreMap> sentences2;
        TokenSequenceMatcher matcher2;
        String text2;

        props2 = new Properties();
        props2.put("annotators", "tokenize, ssplit, pos, lemma, ner, regexner, parse, dcoref");
        pipeline2 = new StanfordCoreNLP(props2);
        text2 = "March 1, 1999";
        pattern2 = TokenSequencePattern.compile("pattern: (([{ner:DATE}])");
        document2 = new Annotation(text2);
        pipeline2.annotate(document2);
        sentences2 = document2.get(CoreAnnotations.SentencesAnnotation.class);
        matcher2 = pattern2.getMatcher(sentences2);
        log.info("testFirstAttempt: Matches2: " + matcher2.matches());

        props2 = new Properties();
        props2.put("annotators", "tokenize, ssplit, pos, lemma, ner, regexner, parse, dcoref");
        pipeline2 = new StanfordCoreNLP(props2);
        text2 = "John";
        pattern2 = TokenSequencePattern.compile("/John/");
        document2 = new Annotation(text2);
        pipeline2.annotate(document2);
        sentences2 = document2.get(CoreAnnotations.SentencesAnnotation.class);
        matcher2 = pattern2.getMatcher(sentences2);
        log.info("testFirstAttempt: Matches2: " + matcher2.matches());
    }
}

回答1:


Sample code:

package edu.stanford.nlp.examples;

import edu.stanford.nlp.util.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;

import java.util.*;


public class TokensRegexExampleTwo {

  public static void main(String[] args) {

    // set up properties
    Properties props = new Properties();
    props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,tokensregex");
    props.setProperty("tokensregex.rules", "multi-step-per-org.rules");
    props.setProperty("tokensregex.caseInsensitive", "true");

    // set up pipeline
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

    // set up text to annotate
    Annotation annotation = new Annotation("Joe Smith works for Apple Inc.");

    // annotate text
    pipeline.annotate(annotation);

    // print out found entities
    for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) {
      for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
        System.out.println(token.word() + "\t" + token.ner());
      }
    }
  }
}

sample rules file:

ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }

$ORGANIZATION_TITLES = "/inc\.|corp\./"

$COMPANY_INDICATOR_WORDS = "/company|corporation/"

ENV.defaults["stage"] = 1

{ pattern: (/works/ /for/ ([{pos: NNP}]+ $ORGANIZATION_TITLES)), action: (Annotate($1, ner, "RULE_FOUND_ORG") ) }

ENV.defaults["stage"] = 2

{ pattern: (([{pos: NNP}]+) /works/ /for/ [{ner: "RULE_FOUND_ORG"}]), action: (Annotate($1, ner, "RULE_FOUND_PERS") ) }

This will apply NER tags to "Joe Smith" and "Apple Inc.". You can adapt this to your specific case. Please let me know if you want to do something more advanced than just apply NER tags. Note: make sure you put those rules in a file called: "multi-step-per-org.rules".



来源:https://stackoverflow.com/questions/50143681/stanford-nlp-tokens-regex-doesnt-recognize-ner

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!