python pandas get ride of plural “s” in words to prepare for word count

耗尽温柔 提交于 2019-12-25 09:08:17

问题


I have the following python pandas dataframe:

Question_ID | Customer_ID | Answer
    1           234         The team worked very hard ...
    2           234         All the teams have been working together ...

I am going to use my code to count words in the answer column. But beforehand, I want to take out the "s" from the word "teams", so that in the example above I count team: 2 instead of team:1 and teams:1.

How can I do this for all words?


回答1:


You need to use a tokenizer (for breaking a sentence into words) and lemmmatizer (for standardizing word forms), both provided by the natural language toolkit nltk:

import nltk
wnl = nltk.WordNetLemmatizer()
[wnl.lemmatize(word) for word in nltk.wordpunct_tokenize(sentence)]
# ['All', 'the', 'team', 'have', 'been', 'working', 'together']



回答2:


use str.replace to remove the s from any 3 or more letter word that ends in 's'.

df.Answer.str.replace(r'(\w{2,})s\b', r'\1')

0                  The team worked very hard ...
1    All the team have been working together ...
Name: Answer, dtype: object

'{2,}' specifies 2 or more. That combined with the 's' ensures that you'll miss 'is'. You can set it to '{3,}' to ensure you skip 'its' as well.




回答3:


Try the NTLK toolkit. Specifically Stemming and Lemmatization. I have never personally used it but here you can try it out.

Here is an example of some tricky plurals,

its it's his quizzes fishes maths mathematics

becomes

it it ' s hi quizz fish math mathemat

You can see it deals with "his" (and "mathematics") poorly, but then again you could have lots of abbreviated "hellos". This is the nature of the beast.



来源:https://stackoverflow.com/questions/41227373/python-pandas-get-ride-of-plural-s-in-words-to-prepare-for-word-count

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!