Annotate author names using REGEXNER from the stanfordnlp library

问题

My goal is to annotate author names from scientific articles with the entity PERSON. I am particularly interested with the names that match this format (authorname et al. date). For example I would like for this sentence (Minot et al. 2000 ) => to annotate Minot as a PERSON. I am using an adapted version of the code found in the official page of stanford nlp team:

import stanfordnlp

from stanfordnlp.server import CoreNLPClient
# example text
print('---')
print('input text')
print('')

text = "In practice, its scope is broad and includes the analysis of a diverse set of samples such as gut microbiome (Qin et al., 2010), (Minot et al., 2011), environmental (Mizuno et al., 2013) or clinical (Willner et al., 2009), (Negredo et al., 2011), (McMullan et al., 2012) samples."

# set up the client
print('---')
print('starting up Java Stanford CoreNLP Server...')
#Properties dictionary
prop={'regexner.mapping': 'rgxrules.txt', 'annotators': 'tokenize,ssplit,pos,lemma,ner,regexner'}
# set up the client


with CoreNLPClient(properties=prop,timeout=100000, memory='16G',be_quiet=False ) as client:
    # submit the request to the server
    ann = client.annotate(text)
    # get the first sentence
    sentence = ann.sentence[0]

After running the code I get the following false positives and false negative: Negredo is not annotated with PERSON but rather O, and Minot as CITY because it's one of the american cities but in this particular sentence it should be annotated with the name of an author.

My attempt to solve this problem was to add this line to the rgxrules.txt file that I pass to the corenlpclient. Here is the line that I have in this file:

[[A-Z][a-z]] /et/ /al\./\tPERSON

This does not solve the problem you can check if you run the code. Also I don't know how to add the fact that I only want the word that matches '[[A-Z][a-z]]' and that comes before et al. to be annotated with PERSON not the whole sentence 'Minot et al.' for example.

Any idea how I can solve this problem.

Thank you in advance.

回答1:

In terms of matching java regular expressions, I'm pretty sure you want something like

[A-Za-z]+ et al[.]

However, I don't know of any way to avoid labeling et al. such as having a token lookahead. What happens if you then add another line to the regex file which replaces et al. with O? Would probably need to say that PERSON is an allowable overwriting for O

来源：https://stackoverflow.com/questions/61231337/annotate-author-names-using-regexner-from-the-stanfordnlp-library

标签

python

regex

stanford-nlp

ner