问题
I am trying to get the lemmatized version of a single word. Is there a way using "spacy" (fantastic python NLP library) to do this.
Below is the code I have tried but this does not work):
from spacy.lemmatizer import Lemmatizer
from spacy.lookups import Lookups
lookups = Lookups()
lemmatizer = Lemmatizer(lookups)
word = "ducks"
lemmas = lemmatizer.lookup(word)
print(lemmas)
The result I was hoping for was that the word "ducks" (plural) would result in "duck" (singular). Unfortunately, "ducks" (plural) is returned.
Is there a way of doing this?
NOTE: I realize that I could process an entire string of words from a document (nlp(document)) and then find the required token and then get its lemma (token.lemma_), but the word(s) I need to lemmatize are somewhat dynamic and are not able to be processed as a large document.
回答1:
If you want to lemmatize single token, try the simplified text processing lib TextBlob:
from textblob import TextBlob, Word
# Lemmatize a word
w = Word('ducks')
w.lemmatize()
Output
> duck
Or NLTK
import nltk
from nltk.stem import SnowballStemmer
stemmer = nltk.stem.SnowballStemmer('english')
stemmer.stem('ducks')
Output
> duck
Otherwise you can keep using spaCy, but after disabling parser
and NER
pipeline components:
- Start by downloading a 12M small model (English multi-task CNN trained on OntoNotes)
$ python -m spacy download en_core_web_sm
- Python code
import spacy
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner']) # just keep tagger for lemmatization
" ".join([token.lemma_ for token in nlp('ducks')])
Output
> duck
回答2:
I think you are missing the part where you use the spaCy database as a reference for the lemmatization. If you see the modifications I made to your code below, and provided the output. duck
is the proper lemma_
for ducks
.
import spacy
from spacy.lemmatizer import Lemmatizer
from spacy.lookups import Lookups
lookups = Lookups()
lemmatizer = Lemmatizer(lookups)
word = "ducks"
#load spacy core database
nlp = spacy.load('en_core_web_sm')
#run NLP on input/doc
doc = nlp(word)
#Print formatted token attributes
print("Token Attributes: \n", "token.text, token.pos_, token.tag_, token.dep_, token.lemma_")
for token in doc:
# Print the text and the predicted part-of-speech tag
print("{:<12}{:<12}{:<12}{:<12}{:<12}".format(token.text, token.pos_, token.tag_, token.dep_, token.lemma_))
Output
Token Attributes:
token.text, token.pos_, token.tag_, token.dep_, token.lemma_
ducks NOUN NNS ROOT duck
Lemmatization crucially depends on the part of speech of the token. Only tokens with the same part of speech are mapped to the same lemma.
In the sentence "This is confusing", confusing
is analyzed as an adjective, and therefore it is lemmatized to confusing
. In the sentence "I was confusing you with someone else", by contrast, confusing
is analyzed as a verb, and is lemmatized to confuse
.
If you want tokens with different parts of speech to be mapped to the same lemma, you can use a stemming algorithm such as Porter Stemming (Java), which you can simply call on each token.
回答3:
With NLTK, simply:
>>> from nltk.stem import WordNetLemmatizer
>>> wnl = WordNetLemmatizer()
>>> wnl.lemmatize('ducks')
'duck'
来源:https://stackoverflow.com/questions/59636002/spacy-lemmatization-of-a-single-word