Text mining with Python and pandas

六眼飞鱼酱① 提交于 2021-01-29 08:47:50

问题


this maybe is a duplicate, but I had no luck finding it...

I am working on some text mining in Python with Pandas. I have words in a DataFrame and the Porter stemming next to it with some other statistics. This means similar words having exact same Porter stem can be found in this DataFrame. I would like to aggregate these similar words in a new column then drop the duplicates regarding Porter stem.

import pandas as pd
pda = pd.DataFrame.from_dict({'Word': ['bank', 'hold', 'banking', 'holding', 'bank'], 'Porter': ['bank', 'hold', 'bank', 'hold', 'bank'], 'SomeData': ['12', '13', '12', '13', '12']})

pdm = pd.DataFrame(pda.groupby(['Porter'])['Word'].apply(list))

What I would love to have:

# Word      Porter               Merged    SomeData
# bank        bank      [bank, banking]          12
# hold        hold      [hold, holding]          13
# banking     bank      [bank, banking]          12
# holding     hold      [hold, holding]          13
# bank        bank      [bank, banking]          12

After removing duplicates:

# Word      Porter               Merged    SomeData
# bank        bank      [bank, banking]          12
# hold        hold      [hold, holding]          13

I tried to use, but I came no closer to my goals.

pda.join(pdm, on="Porter", how="left")``

Thank you for any help in advance.

EDIT: code above revised


回答1:


You can apply a set to this instead of a list, so you are removing all the duplicates automaticly:

import pandas as pd
pda = pd.DataFrame.from_dict({'Word': ['bank', 'hold', 'banking', 'holding', 'bank'], 
                              'Porter': ['bank', 'hold', 'bank', 'hold', 'bank'], 
                              'SomeData': ['12', '13', '12', '13', '12']})

pdm = pd.DataFrame(pda.groupby(['Porter'])['Word'].apply(set))


来源:https://stackoverflow.com/questions/53489987/text-mining-with-python-and-pandas

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!