Why does spaCy not preserve intra-word-hyphens during tokenization like Stanford CoreNLP does?

扶醉桌前 提交于 2019-11-29 16:40:09

Although not documented at spacey usage site ,

It looks like that we just need to add regex for *fix we are working with, in this case infix.

Also, it appears we can extend nlp.Defaults.prefixes with custom regex

infixes = nlp.Defaults.prefixes + (r"[./]", r"[-]~", r"(.'.)")

This will give you desired result. There is no need set default to prefix and suffix since we are not working with those.

import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex
import re

nlp = spacy.load('en')

infixes = nlp.Defaults.prefixes + (r"[./]", r"[-]~", r"(.'.)")

infix_re = spacy.util.compile_infix_regex(infixes)

def custom_tokenizer(nlp):
    return Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)

nlp.tokenizer = custom_tokenizer(nlp)

s1 = "Marketing-Representative- won't die in car accident."
s2 = "Out-of-box implementation"

for s in s1,s2:
    doc = nlp("{}".format(s))
    print([token.text for token in doc])

Result

$python3 /tmp/nlp.py  
['Marketing-Representative-', 'wo', "n't", 'die', 'in', 'car', 'accident', '.']  
['Out-of-box', 'implementation']  

You may want to fix addon regex to make it more robust for other kind of tokens that are close to the applied regex.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!