Using regex in spaCy: matching various (different cased) words

问题

Edit due to off-topic

I want to use regex in SpaCy to find any combination of (Accrued or accrued or Annual or annual) leave by this code:

from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm')

matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
matcher.add('LEAVE', None, 
            [{'TEXT': {"REGEX": "(Accrued|accrued|Annual|annual)"}}, 
             {'LOWER': 'leave'}])

# Call the matcher on the doc
doc= nlp('Annual leave shall be paid at the time . An employee is  to receive their annual leave payment in the normal pay cycle. Where an employee has accrued annual leave in')

matches = matcher(doc)

# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print('- ', matched_span.sent.text)

# returned:
- Annual leave shall be paid at the time .
- An employee is  to receive their annual leave payment in the normal pay cycle.
- Where an employee has accrued annual leave in

However, I think my regex was not abstract/generalized enough to be applied to other situations, I would be very much appreciated for your advice on how to improve my regex expression with spaCy.

回答1:

Your code is fine, you just have a typo in ananual and your code will yield all 3 sentences then.

However, you do not need to repeat the differently cased words. With Python re regex, you may pass the (?i) inline modifier to the pattern start and it will all be case insensitive.

You may use

"(?i)accrued|annual"

Or, to match whole words, add word boundaries \b:

r"(?i)\b(?:accrued|annual)\b"

Note the r prefix before the opening " making the string literal raw, and you do not have to escape \ in it. r"\b" = "\\b".

The (?:...) non-capturing group is there to make sure \b word boundaries get applied to all the alternatives inside the group. \baccrued|annual\b will match accruednesssss or biannual, for example (it will match words that start with accrued or those ending with annual).

回答2:

In many NLP libraries, the tokenizing activity lowercases all tokens, making it unecessary to create a regex for each word. That is the case for Spacy.

However, Spacy matcher works better if you make use of the linguistic features that it is packaged with.

Let's start by creating a matcher based on linguistic features: you want to detect any type of leave (annual and I guess in the future you might consider monthly, weekly, etc) - these are all adjectives. So you could define a pattern that includes the "leave" word preceded by an adjective, like so:

pattern = [{'POS': 'ADJ'},
           {'LEMMA': 'leave'}]

In the above snippet, POS stands for Part of Speech and recieives the value of ADJ (for adjective). LEMMA stands for the word 'root'. You can check this online example. Notice, however, that "accrued" is being recognized as a verb, and not adjective (in fact, this polysemy problem is there for any NLP library). You could also another pattern just for "accrued leave", using two "lemma" values.

Just add the matcher and you're good to go:

matcher = Matcher(nlp.vocab)
matcher.add(pattern)
matches = matcher(doc)

# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print('- ', matched_span.sent.text)

来源：https://stackoverflow.com/questions/57573368/using-regex-in-spacy-matching-various-different-cased-words

标签

python

nlp

spacy