问题
I have the following text
text = 'Monday to Friday 12 midnight to 5am 30% . Midnight Friday to 6am Saturday 30% . 9pm Saturday to Midnight Saturday 25% . Midnight Saturday to 6am Sunday 100% . 6am Sunday to 9pm Sunday 50%'
When I used normal regex, I obtained the following
import re
regex = '\d{1}[a|p]m'
re.findall(regex, text)
# Returned:
['5am', '6am', '9pm', '6am', '6am', '6pm']
However, when I used the same regex
in spaCy, I got nothing back
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_lg')
matcher = Matcher(nlp.vocab)
pattern = [{'TEXT': {'REGEX': '\d{1}[a|p]m'}}]
matcher.add('TIME', None, pattern)
doc = nlp(text)
matches = matcher(doc)
for match_id, start, end in matches:
matched_span = doc[start:end]
print(matched_span.sent.text)
Does that mean we can't use normal regex with spaCy? If so, do you know where I can learn the special regex syntax of spaCy? Thank you.
回答1:
You need to keep in mind that numbers will be separated from the letters here, see the test:
doc = nlp("1pm")
print([token.text for token in doc]) # => ['1', 'pm']
As per Spacy docs:
If spaCy’s tokenization doesn’t match the tokens defined in a pattern, the pattern is not going to produce any results.
You need to define your own entity using rule-based matching:
pattern = [{'LIKE_NUM': True}, {'LOWER': {'REGEX' : '^[ap]m$'}}]
Then add it to matcher:
matcher.add('TIME', None, pattern)
And get the matches:
for match_id, start, end in matches:
span = doc[start:end] # The matched span
print(span.text)
Full demo:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
text = 'Monday to Friday 12 midnight to 5am 30% . Midnight Friday to 6am Saturday 30% . 9pm Saturday to Midnight Saturday 25% . Midnight Saturday to 6am Sunday 100% . 6am Sunday to 9pm Sunday 50%'
doc = nlp(text)
matcher = Matcher(nlp.vocab)
pattern = [{'LIKE_NUM': True}, {'LOWER': {'REGEX' : '^[ap]m$'}}]
matcher.add('TIME', None, pattern)
matches = matcher(doc)
print([doc[start:end] for match_id, start, end in matches])
#=> [5am, 6am, 9pm, 6am, 6am, 9pm]
来源:https://stackoverflow.com/questions/57727543/spacys-regex-is-different-to-pythons-regex