NLTK-based stemming and lemmatization

折月煮酒 提交于 2019-12-08 19:26:27

No, your current approach does not work, because you must pass one word at a time to the lemmatizer/stemmer, otherwise, those functions won't know to interpret your string as a sentence (they expect words).

import re

__stop_words = set(nltk.corpus.stopwords.words('english'))

def clean(tweet):
    cleaned_tweet = re.sub(r'([^\w\s]|\d)+', '', tweets.lower())
    return ' '.join([lemmatizer.lemmatize(i, 'v') 
                for i in cleaned_tweet.split() if i not in __stop_words])

Alternatively, you can use a PorterStemmer, which does the same thing as lemmatisation, but without context.

from nltk.stem.porter import PorterStemmer  
stemmer = PorterStemmer() 

And, call the stemmer like this:

stemmer.stem(i)

I think this is what you're looking for, but do this prior to calling the lemmatizer as the commenter noted.

>>>import re
>>>s = "This is a beautiful day16~. I am; working on an exercise45.^^^45text34."
>>>s = re.sub(r'[^A-Za-z ]', '', s)
This is a beautiful day I am working on an exercise text
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!