Spacy custom tokenizer to include only hyphen words as tokens using Infix regex

I want to include hyphen words for example: long-term, self-esteem, etc. as a single token in Spacy. After looking at some similar posts on Stackoverflow, Github, its documentation and elsewhere, I also wrote a custom tokenizer as below.

import re
from spacy.tokenizer import Tokenizer

prefix_re = re.compile(r'''^[\[\("']''')
suffix_re = re.compile(r'''[\]\)"']$''')
infix_re = re.compile(r'''[.\,\?\:\;\...\‘\’\`\“\”\"\'~]''')

def custom_tokenizer(nlp):
    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                                suffix_search=suffix_re.search,
                                infix_finditer=infix_re.finditer,
                                token_match=None)

nlp = spacy.load('en_core_web_lg')
nlp.tokenizer = custom_tokenizer(nlp)

doc = nlp(u'Note: Since the fourteenth century the practice of “medicine” has become a profession; and more importantly, it\'s a male-dominated profession.')
[token.text for token in doc]

So for this sentence: 'Note: Since the fourteenth century the practice of “medicine” has become a profession; and more importantly, it\'s a male-dominated profession.'

Now, the tokens after incorporating the custom Spacy Tokenizer are:

'Note', ':', 'Since', 'the', 'fourteenth', 'century', 'the', 'practice', 'of', '“medicine', '”', 'has', ';', 'become', 'a', 'profession', ',', 'and', 'more', 'importantly', ',', "it's", 'a', 'male-dominated', 'profession', '.'

Earlier, the tokens before this change were:

'Note', ':', 'Since', 'the', 'fourteenth', 'century', 'the', 'practice', 'of', '“', 'medicine', '”', 'has', 'become', 'a', 'profession', ';', 'and', 'more', 'importantly', ',', 'it', "'s", 'a', 'male', '-', 'dominated', 'profession', '.'

And, the expected tokens should be:

As one can see the hyphen word is included and so are the other punctuation marks except for the double quotes and apostrophe. But now, the apostrophe and double quotes are not having the earlier or expected behaviour. I have tried different permutations and combinations for the regex compile for the Infix but no progress to fix this issue. Hence, any help would be highly appreciated.

Using the default prefix_re and suffix_re gives me the expected output:

import re
import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex

def custom_tokenizer(nlp):
    infix_re = re.compile(r'''[.\,\?\:\;\...\‘\’\`\“\”\"\'~]''')
    prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
    suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)

    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                                suffix_search=suffix_re.search,
                                infix_finditer=infix_re.finditer,
                                token_match=None)

nlp = spacy.load('en')
nlp.tokenizer = custom_tokenizer(nlp)

doc = nlp(u'Note: Since the fourteenth century the practice of “medicine” has become a profession; and more importantly, it\'s a male-dominated profession.')
[token.text for token in doc]

['Note', ':', 'Since', 'the', 'fourteenth', 'century', 'the', 'practice', 'of', '“', 'medicine', '”', 'has', 'become', 'a', 'profession', ';', 'and', 'more', 'importantly', ',', 'it', "'s", 'a', 'male-dominated', 'profession', '.']

If you want to dig into to why your regexes weren't working like SpaCy's, here are links to the relevant source code:

Prefixes and suffixes defined here:

https://github.com/explosion/spaCy/blob/master/spacy/lang/punctuation.py

With reference to characters (e.g, quotes, hyphens, etc.) defined here:

https://github.com/explosion/spaCy/blob/master/spacy/lang/char_classes.py

And the functions used to compile them (e.g., compile_prefix_regex):

https://github.com/explosion/spaCy/blob/master/spacy/util.py

来源：https://stackoverflow.com/questions/51012476/spacy-custom-tokenizer-to-include-only-hyphen-words-as-tokens-using-infix-regex

标签

regex