How to add custom slangs into spaCy's norm_exceptions.py module?

匆匆过客 提交于 2019-12-06 12:11:15

The norm exceptions are part of the language data and the attribute getter (the function that takes a text and returns the norm), is initialised with the language class, e.g. English. You can see an example of this here. This all happens before the pipeline is even constructed.

The assumption here is that the norm exceptions are usually language-specific and should thus be defined in the language data, independent of the processing pipeline. Norms are also lexical attributes, so their getters live on the underlying lexeme, the context-insensitive entry in the vocabulary (as opposed to a token, which is the word in context).

However, the nice thing about the token.norm_ is that it's writeable – so you can easily add a custom pipeline component that looks up the token's text in your own dictionary, and overwrites the norm if necessary:

def add_custom_norms(doc):
    for token in doc:
        if token.text in YOUR_NORM_DICT:
            token.norm_ = YOUR_NORM_DICT[token.text]
    return doc

nlp.add_pipe(add_custom_norms, last=True)

Keep in mind that the NORM attribute is also used as a feature in the model, so depending on the norms you want to add or overwrite, you might want to only apply your custom component after the tagger, parser or entity recognizer is called.

For example, by default, spaCy normalises all currency symbols to "$" to ensure that they all receive similar representations, even if one of them is less frequent in the training data. If your custom component now overwrites "€" with "Euro", this will also have an impact on the model's predictions. So you might see less accurate predictions for MONEY entities.

If you're planning on training your own model that takes your custom norms into account, you might want to consider implementing a custom language subclass. Alternatively, if you think that the slang terms you want to add should be included in spaCy by default, you can always submit a pull request, for example to the English norm_exceptions.py.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!