Text mining with Python and pandas

问题

this maybe is a duplicate, but I had no luck finding it...

I am working on some text mining in Python with Pandas. I have words in a DataFrame and the Porter stemming next to it with some other statistics. This means similar words having exact same Porter stem can be found in this DataFrame. I would like to aggregate these similar words in a new column then drop the duplicates regarding Porter stem.

import pandas as pd
pda = pd.DataFrame.from_dict({'Word': ['bank', 'hold', 'banking', 'holding', 'bank'], 'Porter': ['bank', 'hold', 'bank', 'hold', 'bank'], 'SomeData': ['12', '13', '12', '13', '12']})

pdm = pd.DataFrame(pda.groupby(['Porter'])['Word'].apply(list))

What I would love to have:

# Word      Porter               Merged    SomeData
# bank        bank      [bank, banking]          12
# hold        hold      [hold, holding]          13
# banking     bank      [bank, banking]          12
# holding     hold      [hold, holding]          13
# bank        bank      [bank, banking]          12

After removing duplicates:

# Word      Porter               Merged    SomeData
# bank        bank      [bank, banking]          12
# hold        hold      [hold, holding]          13

I tried to use, but I came no closer to my goals.

pda.join(pdm, on="Porter", how="left")``

Thank you for any help in advance.

EDIT: code above revised

回答1:

You can apply a set to this instead of a list, so you are removing all the duplicates automaticly:

import pandas as pd
pda = pd.DataFrame.from_dict({'Word': ['bank', 'hold', 'banking', 'holding', 'bank'], 
                              'Porter': ['bank', 'hold', 'bank', 'hold', 'bank'], 
                              'SomeData': ['12', '13', '12', '13', '12']})

pdm = pd.DataFrame(pda.groupby(['Porter'])['Word'].apply(set))

来源：https://stackoverflow.com/questions/53489987/text-mining-with-python-and-pandas

标签

python

pandas

text-mining