How to stem a pandas dataframe using nltk ? The output should be a stemmed dataframe

≡放荡痞女 提交于 2021-01-07 03:11:56

问题


I'm trying to pre-process a dataset. The dataset contains text data. I have created a pandas DataFrame from that dataset. my question is, how can I use stemming on the DataFrame and get a stemmed DataFrame as output?


回答1:


Given a certain pandas df you can stem the contents by applying a stemming function on the whole df after tokenizing the words.

For this, I exemplarily used the snowball stemmer from nltk.

from nltk.stem.snowball import SnowballStemmer
englishStemmer=SnowballStemmer("english") #define stemming dict

And this tokenizer:

from nltk.tokenize import WhitespaceTokenizer as w_tokenizer

Define your function:

def stemm_texts(text):
    return [englishStemmer.stem(w) for w in w_tokenizer.tokenize(str(text))]

Apply the function on your df:

df = df.apply(lambda y: y.map(stemm_texts, na_action='ignore'))

Note that I additionally added the NaN ignore part.

You might want to detokenize again:

from nltk.tokenize.treebank import TreebankWordDetokenizer

detokenizer = TreebankWordDetokenizer()
df = df.apply(lambda y: y.map(detokenizer.detokenize, na_action='ignore'))


来源:https://stackoverflow.com/questions/55482342/how-to-stem-a-pandas-dataframe-using-nltk-the-output-should-be-a-stemmed-dataf

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!