问题
SpaCy Version: 2.0.11
Python Version: 3.6.5
OS: Ubuntu 16.04
My Sentence Samples:
Marketing-Representative- won't die in car accident.
or
Out-of-box implementation
Expected Tokens:
["Marketing-Representative", "-", "wo", "n't", "die", "in", "car", "accident", "."]
["Out-of-box", "implementation"]
SpaCy Tokens(Default Tokenizer):
["Marketing", "-", "Representative-", "wo", "n't", "die", "in", "car", "accident", "."]
["Out", "-", "of", "-", "box", "implementation"]
I tried creating custom tokenizer but it won't handle all edge cases as handled by spaCy using tokenizer_exceptions(Code below):
import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex
import re
nlp = spacy.load('en')
prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)
infix_re = re.compile(r'''[.\,\?\:\;\...\‘\’\`\“\”\"\'~]''')
def custom_tokenizer(nlp):
return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
suffix_search=suffix_re.search,
infix_finditer=infix_re.finditer,
token_match=None)
nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp("Marketing-Representative- won't die in car accident.")
for token in doc:
print(token.text)
Output:
Marketing-Representative-
won
'
t
die
in
car
accident
.
I need someone to guide me towards the appropriate way of doing this.
Either making changes in the regex above could do it or any other method or I even tried spaCy's Rule-Based Matcher but wasn't able to create rule to handle hyphens between more than 2 words e.g. "out-of-box" so that a Matcher can be created to be used with span.merge().
Either way I need to have words containing intra-word-hyphens to become single token as handled by Stanford CoreNLP.
回答1:
Although not documented at spacey usage site ,
It looks like that we just need to add regex for *fix we are working with, in this case infix.
Also, it appears we can extend nlp.Defaults.prefixes with custom regex
infixes = nlp.Defaults.prefixes + (r"[./]", r"[-]~", r"(.'.)")
This will give you desired result. There is no need set default to prefix and suffix since we are not working with those.
import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex
import re
nlp = spacy.load('en')
infixes = nlp.Defaults.prefixes + (r"[./]", r"[-]~", r"(.'.)")
infix_re = spacy.util.compile_infix_regex(infixes)
def custom_tokenizer(nlp):
return Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)
nlp.tokenizer = custom_tokenizer(nlp)
s1 = "Marketing-Representative- won't die in car accident."
s2 = "Out-of-box implementation"
for s in s1,s2:
doc = nlp("{}".format(s))
print([token.text for token in doc])
Result
$python3 /tmp/nlp.py
['Marketing-Representative-', 'wo', "n't", 'die', 'in', 'car', 'accident', '.']
['Out-of-box', 'implementation']
You may want to fix addon regex to make it more robust for other kind of tokens that are close to the applied regex.
来源:https://stackoverflow.com/questions/52293874/why-does-spacy-not-preserve-intra-word-hyphens-during-tokenization-like-stanford