How to add custom slangs into spaCy's norm_exceptions.py module?

问题

SpaCy's documentation has some information on adding new slangs here.

However, I'd like to know:

(1) When should I call the following function?

lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM], NORM_EXCEPTIONS, BASE_NORMS)

The typical usage of spaCy, according to the introduction guide here, is something as follows:

import spacy
nlp = spacy.load('en')
# Should I call the function add_lookups(...) here?
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')

(2) When in the processing pipeline are norm exceptions handled?

I'm assuming a typical pipeline as such: tokenizer -> tagger -> parser -> ner.

Are norm exceptions handled right before the tokenizer? And also, how is the norm exceptions component organized with respect to the other pre-processing components such as stop words, lemmatizer (see full list of components here)? What comes before what?

Am new to spaCy and much help would be appreciated. Thanks!

回答1:

The norm exceptions are part of the language data and the attribute getter (the function that takes a text and returns the norm), is initialised with the language class, e.g. English. You can see an example of this here. This all happens before the pipeline is even constructed.

The assumption here is that the norm exceptions are usually language-specific and should thus be defined in the language data, independent of the processing pipeline. Norms are also lexical attributes, so their getters live on the underlying lexeme, the context-insensitive entry in the vocabulary (as opposed to a token, which is the word in context).

However, the nice thing about the token.norm_ is that it's writeable – so you can easily add a custom pipeline component that looks up the token's text in your own dictionary, and overwrites the norm if necessary:

def add_custom_norms(doc):
    for token in doc:
        if token.text in YOUR_NORM_DICT:
            token.norm_ = YOUR_NORM_DICT[token.text]
    return doc

nlp.add_pipe(add_custom_norms, last=True)

Keep in mind that the NORM attribute is also used as a feature in the model, so depending on the norms you want to add or overwrite, you might want to only apply your custom component after the tagger, parser or entity recognizer is called.

For example, by default, spaCy normalises all currency symbols to "$" to ensure that they all receive similar representations, even if one of them is less frequent in the training data. If your custom component now overwrites "€" with "Euro", this will also have an impact on the model's predictions. So you might see less accurate predictions for MONEY entities.

If you're planning on training your own model that takes your custom norms into account, you might want to consider implementing a custom language subclass. Alternatively, if you think that the slang terms you want to add should be included in spaCy by default, you can always submit a pull request, for example to the English norm_exceptions.py.

来源：https://stackoverflow.com/questions/49493232/how-to-add-custom-slangs-into-spacys-norm-exceptions-py-module

标签

nlp

spacy