问题
lemmatization on pronouns via [token.lemma_ for token in doc]
gives lemmatized word for pronouns as -PRON-
, is this a bug?
回答1:
No, this is in fact intended behaviour. See the documentation here:
Unlike verbs and common nouns, there's no clear base form of a personal pronoun. Should the lemma of "me" be "I", or should we normalize person as well, giving "it" — or maybe "he"? spaCy's solution is to introduce a novel symbol,
-PRON-
, which is used as the lemma for all personal pronouns.
It might be worth noting that this convention may change in the future, as spaCy moves towards better compatibility with the Universal Dependencies format.
回答2:
The following piece of code may help you to eliminate the -PRON-
from your lemmatized text in lower case.
[token.lemma_.lower() if token.lemma_ != '-PRON-' else token.lower_ for token in doc]
来源:https://stackoverflow.com/questions/50543752/spacy-lemmatization-on-pronouns-gives-some-erronous-output