问题
When RRB is not separated by a space with its following word, it will be recognized as part of the word.
In [34]: nlp("Indonesia (CNN)AirAsia ")
Out[34]: Indonesia (CNN)AirAsia
In [35]: d=nlp("Indonesia (CNN)AirAsia ")
In [36]: [(t.text, t.lemma_, t.pos_, t.tag_) for t in d]
Out[36]:
[('Indonesia', 'Indonesia', 'PROPN', 'NNP'),
('(', '(', 'PUNCT', '-LRB-'),
('CNN)AirAsia', 'CNN)AirAsia', 'PROPN', 'NNP')]
In [39]: d=nlp("(CNN)Police")
In [40]: [(t.text, t.lemma_, t.pos_, t.tag_) for t in d]
Out[40]: [('(', '(', 'PUNCT', '-LRB-'), ('CNN)Police', 'cnn)police', 'VERB', 'VB')]
Expected result is
In [37]: d=nlp("(CNN) Police")
In [38]: [(t.text, t.lemma_, t.pos_, t.tag_) for t in d]
Out[38]:
[('(', '(', 'PUNCT', '-LRB-'),
('CNN', 'CNN', 'PROPN', 'NNP'),
(')', ')', 'PUNCT', '-RRB-'),
('Police', 'Police', 'NOUN', 'NNS')]
Is this a bug? Any suggestions to fix the issue?
回答1:
Use a custom tokenizer to add the r'\b\)\b' rule (see this regex demo) to infixes. The regex matches a ) that is preceded with any word char (letter, digit, _, and in Python 3, some other rare characters) and is followed with this type of char.
You may customize this regex further, so a lot depends on what context you want to match the ) in.
See the full Python demo:
import spacy
import re
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex
nlp = spacy.load('en_core_web_sm')
def custom_tokenizer(nlp):
infixes = tuple([r"\b\)\b"]) + nlp.Defaults.infixes
infix_re = spacy.util.compile_infix_regex(infixes)
prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)
return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
suffix_search=suffix_re.search,
infix_finditer=infix_re.finditer,
token_match=nlp.tokenizer.token_match,
rules=nlp.Defaults.tokenizer_exceptions)
nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp("Indonesia (CNN)AirAsia ")
print([(t.text, t.lemma_, t.pos_, t.tag_) for t in doc] )
Output:
[('Indonesia', 'Indonesia', 'PROPN', 'NNP'), ('(', '(', 'PUNCT', '-LRB-'), ('CNN', 'CNN', 'PROPN', 'NNP'), (')', ')', 'PUNCT', '-RRB-'), ('AirAsia', 'AirAsia', 'PROPN', 'NNP')]
回答2:
Alternative solution that does not need custom tokenizer
nlp = spacy.blank('en')
infixes = (
LIST_ELLIPSES
+ LIST_ICONS
+ [
r"(?<=[0-9])[+\-\*^](?=[0-9-])",
r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
),
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
# Additions to infix rules begin here
# bracket between characters
r"\b\)\b"
]
)
infix_re = compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_re.finditer
Then save this model and use it as a base model when training your new model.
来源:https://stackoverflow.com/questions/56439423/spacy-parenthesis-tokenization-pairs-of-lrb-rrb-not-tokenized-correctly