Spacy lemmatization of a single word

问题

I am trying to get the lemmatized version of a single word. Is there a way using "spacy" (fantastic python NLP library) to do this.

Below is the code I have tried but this does not work):

from spacy.lemmatizer import Lemmatizer
from spacy.lookups import Lookups
lookups = Lookups()
lemmatizer = Lemmatizer(lookups)
word = "ducks"
lemmas = lemmatizer.lookup(word)
print(lemmas)

The result I was hoping for was that the word "ducks" (plural) would result in "duck" (singular). Unfortunately, "ducks" (plural) is returned.

Is there a way of doing this?

NOTE: I realize that I could process an entire string of words from a document (nlp(document)) and then find the required token and then get its lemma (token.lemma_), but the word(s) I need to lemmatize are somewhat dynamic and are not able to be processed as a large document.

回答1:

If you want to lemmatize single token, try the simplified text processing lib TextBlob:

from textblob import TextBlob, Word
# Lemmatize a word
w = Word('ducks')
w.lemmatize()

Output

> duck

Or NLTK

import nltk
from nltk.stem import SnowballStemmer
stemmer = nltk.stem.SnowballStemmer('english')
stemmer.stem('ducks')

Output

> duck

Otherwise you can keep using spaCy, but after disabling parser and NER pipeline components:

Start by downloading a 12M small model (English multi-task CNN trained on OntoNotes)

$ python -m spacy download en_core_web_sm

Python code

import spacy
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner']) # just keep tagger for lemmatization
" ".join([token.lemma_ for token in nlp('ducks')])

Output

> duck

回答2:

I think you are missing the part where you use the spaCy database as a reference for the lemmatization. If you see the modifications I made to your code below, and provided the output. duck is the proper lemma_ for ducks.

import spacy
from spacy.lemmatizer import Lemmatizer
from spacy.lookups import Lookups
lookups = Lookups()
lemmatizer = Lemmatizer(lookups)

word = "ducks"
#load spacy core database
nlp = spacy.load('en_core_web_sm')
#run NLP on input/doc
doc = nlp(word)
#Print formatted token attributes
print("Token Attributes: \n", "token.text, token.pos_, token.tag_, token.dep_, token.lemma_")
for token in doc:
    # Print the text and the predicted part-of-speech tag
    print("{:<12}{:<12}{:<12}{:<12}{:<12}".format(token.text, token.pos_, token.tag_, token.dep_, token.lemma_))

Output

Token Attributes: 
 token.text, token.pos_, token.tag_, token.dep_, token.lemma_
ducks       NOUN        NNS         ROOT        duck

Lemmatization crucially depends on the part of speech of the token. Only tokens with the same part of speech are mapped to the same lemma.

In the sentence "This is confusing", confusing is analyzed as an adjective, and therefore it is lemmatized to confusing. In the sentence "I was confusing you with someone else", by contrast, confusing is analyzed as a verb, and is lemmatized to confuse.

If you want tokens with different parts of speech to be mapped to the same lemma, you can use a stemming algorithm such as Porter Stemming (Java), which you can simply call on each token.

回答3:

With NLTK, simply:

>>> from nltk.stem import WordNetLemmatizer
>>> wnl = WordNetLemmatizer()
>>> wnl.lemmatize('ducks')
'duck'

来源：https://stackoverflow.com/questions/59636002/spacy-lemmatization-of-a-single-word

标签

nlp

spacy