Spacy - lemmatization on pronouns gives some erronous output

元气小坏坏 提交于 2020-06-01 06:01:05

问题


lemmatization on pronouns via [token.lemma_ for token in doc] gives lemmatized word for pronouns as -PRON- , is this a bug?


回答1:


No, this is in fact intended behaviour. See the documentation here:

Unlike verbs and common nouns, there's no clear base form of a personal pronoun. Should the lemma of "me" be "I", or should we normalize person as well, giving "it" — or maybe "he"? spaCy's solution is to introduce a novel symbol, -PRON-, which is used as the lemma for all personal pronouns.

It might be worth noting that this convention may change in the future, as spaCy moves towards better compatibility with the Universal Dependencies format.




回答2:


The following piece of code may help you to eliminate the -PRON- from your lemmatized text in lower case.

[token.lemma_.lower() if token.lemma_ != '-PRON-' else token.lower_ for token in doc]


来源:https://stackoverflow.com/questions/50543752/spacy-lemmatization-on-pronouns-gives-some-erronous-output

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!