spaCy's regex is different to Python's regex

我的梦境 提交于 2020-11-29 10:25:09

问题


I have the following text

text = 'Monday to Friday 12 midnight to 5am 30% . Midnight Friday to 6am Saturday 30% . 9pm Saturday to Midnight Saturday 25% . Midnight Saturday to 6am Sunday 100% . 6am Sunday to 9pm Sunday 50%'

When I used normal regex, I obtained the following

import re
regex = '\d{1}[a|p]m'
re.findall(regex, text)

# Returned:
['5am', '6am', '9pm', '6am', '6am', '6pm']

However, when I used the same regex in spaCy, I got nothing back

from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_lg')

matcher = Matcher(nlp.vocab)
pattern = [{'TEXT': {'REGEX': '\d{1}[a|p]m'}}]
matcher.add('TIME', None, pattern)

doc = nlp(text)
matches = matcher(doc)

for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.sent.text)

Does that mean we can't use normal regex with spaCy? If so, do you know where I can learn the special regex syntax of spaCy? Thank you.


回答1:


You need to keep in mind that numbers will be separated from the letters here, see the test:

doc = nlp("1pm")
print([token.text for token in doc]) # => ['1', 'pm']

As per Spacy docs:

If spaCy’s tokenization doesn’t match the tokens defined in a pattern, the pattern is not going to produce any results.

You need to define your own entity using rule-based matching:

pattern = [{'LIKE_NUM': True}, {'LOWER': {'REGEX' : '^[ap]m$'}}]

Then add it to matcher:

matcher.add('TIME', None, pattern)

And get the matches:

for match_id, start, end in matches:
    span = doc[start:end]  # The matched span
    print(span.text)

Full demo:

import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")

text = 'Monday to Friday 12 midnight to 5am 30% . Midnight Friday to 6am Saturday 30% . 9pm Saturday to Midnight Saturday 25% . Midnight Saturday to 6am Sunday 100% . 6am Sunday to 9pm Sunday 50%'
doc = nlp(text)

matcher = Matcher(nlp.vocab)
pattern = [{'LIKE_NUM': True}, {'LOWER': {'REGEX' : '^[ap]m$'}}]
matcher.add('TIME', None, pattern)

matches = matcher(doc)
print([doc[start:end] for match_id, start, end in matches])
#=> [5am, 6am, 9pm, 6am, 6am, 9pm]


来源:https://stackoverflow.com/questions/57727543/spacys-regex-is-different-to-pythons-regex

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!