问题
this maybe is a duplicate, but I had no luck finding it...
I am working on some text mining in Python with Pandas. I have words in a DataFrame and the Porter stemming next to it with some other statistics. This means similar words having exact same Porter stem can be found in this DataFrame. I would like to aggregate these similar words in a new column then drop the duplicates regarding Porter stem.
import pandas as pd
pda = pd.DataFrame.from_dict({'Word': ['bank', 'hold', 'banking', 'holding', 'bank'], 'Porter': ['bank', 'hold', 'bank', 'hold', 'bank'], 'SomeData': ['12', '13', '12', '13', '12']})
pdm = pd.DataFrame(pda.groupby(['Porter'])['Word'].apply(list))
What I would love to have:
# Word Porter Merged SomeData
# bank bank [bank, banking] 12
# hold hold [hold, holding] 13
# banking bank [bank, banking] 12
# holding hold [hold, holding] 13
# bank bank [bank, banking] 12
After removing duplicates:
# Word Porter Merged SomeData
# bank bank [bank, banking] 12
# hold hold [hold, holding] 13
I tried to use, but I came no closer to my goals.
pda.join(pdm, on="Porter", how="left")``
Thank you for any help in advance.
EDIT: code above revised
回答1:
You can apply a set to this instead of a list, so you are removing all the duplicates automaticly:
import pandas as pd
pda = pd.DataFrame.from_dict({'Word': ['bank', 'hold', 'banking', 'holding', 'bank'],
'Porter': ['bank', 'hold', 'bank', 'hold', 'bank'],
'SomeData': ['12', '13', '12', '13', '12']})
pdm = pd.DataFrame(pda.groupby(['Porter'])['Word'].apply(set))
来源:https://stackoverflow.com/questions/53489987/text-mining-with-python-and-pandas