发表新帖

发表新帖

Lemmatization of all pandas cells

后端未结

关注

 2  1767

暖寄归人 2021-01-02 08:51

I have a panda dataframe. There is one column, let\'s name it: \'col\' Each entry of this column is a list of words. [\'word1\', \'word2\', etc.]

How can I efficient

2条回答

夕颜 (楼主)

2021-01-02 09:29
You can use apply from pandas with a function to lemmatize each words in the given string. Note that there are many ways to tokenize your text. You might have to remove symbols like . if you use whitespace tokenizer.

Below, I give an example on how to lemmatize a column of example dataframe.
```
import nltk

w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()

def lemmatize_text(text):
    return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]

df = pd.DataFrame(['this was cheesy', 'she likes these books', 'wow this is great'], columns=['text'])
df['text_lemmatized'] = df.text.apply(lemmatize_text)
```
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...

热议问题