Spacy lemmatizer issue/consistency

徘徊边缘 提交于 2021-02-11 18:21:09

问题


I'm currently using spaCy for NLP purpose (mainly lemmatization and tokenization). The model used is en-core-web-sm (2.1.0).

The following code is run to retrieve a list of words "cleansed" from a query

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(query)
list_words = []
for token in doc:
    if token.text != ' ':
        list_words.append(token.lemma_)

However I face a major issue, when running this code. For example, when the query is "processing of tea leaves". The result stored in list_words can be either ['processing', 'tea', 'leaf'] or ['processing', 'tea', 'leave'].

It seems that the result is not consistent. I cannot change my input/query (adding another word for context is not possible) and I really need to find the same result every time. I think the loading of the model may be the issue.

Why the result differ ? Can I load the model the "same" way everytime ? Did I miss a parameter to obtain the same result for ambiguous query ?

Thanks for your help


回答1:


The issue was analysed by the spaCy team and they've come up with a solution. Here's the fix : https://github.com/explosion/spaCy/pull/3646

Basically, when the lemmatization rules were applied, a set was used to return a lemma. Since a set has no ordering, the returned lemma could change in between python session.


For example in my case, for the noun "leaves", the potential lemmas were "leave" and "leaf". Without ordering, the result was random - it could be "leave" or "leaf".



来源:https://stackoverflow.com/questions/55864933/spacy-lemmatizer-issue-consistency

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!