pandas: Group by splitting string value in all rows (a column) and aggregation function

问题

If i have dataset like this:

id   person_name                       salary
0    [alexander, william, smith]       45000
1    [smith, robert, gates]            65000
2    [bob, alexander]                  56000
3    [robert, william]                 80000
4    [alexander, gates]                70000

If we sum that salary column then we will get 316000

I really want to know how much person who named 'alexander, smith, etc' (in distinct) makes in salary if we sum all of the salaries from its splitting name in this dataset (that contains same string value).

output:

group               sum_salary
alexander           171000 #sum from id 0 + 2 + 4 (which contain 'alexander')
william             125000 #sum from id 0 + 3
smith               110000 #sum from id 0 + 1
robert              145000 #sum from id 1 + 3
gates               135000 #sum from id 1 + 4
bob                  56000 #sum from id 2

as we see the sum of sum_salary columns is not the same as the initial dataset. all because the function requires double counting.

I thought it seems familiar like string count, but what makes me confuse is the way we use aggregation function. I've tried creating a new list of distinct value in person_name columns, then stuck comes.

Any help is appreciated, Thank you very much

回答1:

Solutions working with lists in column person_name:

#if necessary
#df['person_name'] = df['person_name'].str.strip('[]').str.split(', ')

print (type(df.loc[0, 'person_name']))
<class 'list'>

First idea is use defaultdict for store sumed values in loop:

from collections import defaultdict

d = defaultdict(int)
for p, s in zip(df['person_name'], df['salary']):
    for x in p:
        d[x] += int(s)

print (d)
defaultdict(<class 'int'>, {'alexander': 171000, 
                            'william': 125000, 
                            'smith': 110000, 
                            'robert': 145000, 
                            'gates': 135000, 
                            'bob': 56000})

And then:

df1 = pd.DataFrame({'group':list(d.keys()),
                    'sum_salary':list(d.values())})
print (df1)
       group  sum_salary
0  alexander      171000
1    william      125000
2      smith      110000
3     robert      145000
4      gates      135000
5        bob       56000

Another solution with repeating values by length of lists and aggregate sum:

from itertools import chain

df1 = pd.DataFrame({
    'group' : list(chain.from_iterable(df['person_name'].tolist())), 
    'sum_salary' : df['salary'].values.repeat(df['person_name'].str.len())
})

df2 = df1.groupby('group', as_index=False, sort=False)['sum_salary'].sum()
print (df2)
       group  sum_salary
0  alexander      171000
1    william      125000
2      smith      110000
3     robert      145000
4      gates      135000
5        bob       56000

回答2:

Another sol:

df_new=(pd.DataFrame({'person_name':np.concatenate(df.person_name.values),
                  'salary':df.salary.repeat(df.person_name.str.len())}))
print(df_new.groupby('person_name')['salary'].sum().reset_index())


  person_name  salary
0   alexander  171000
1         bob   56000
2       gates  135000
3      robert  145000
4       smith  110000
5     william  125000

回答3:

Can be done concisely with dummies though performance will suffer due to all of the .str methods:

df.person_name.str.join('*').str.get_dummies('*').multiply(df.salary, 0).sum()

#alexander    171000
#bob           56000
#gates        135000
#robert       145000
#smith        110000
#william      125000
#dtype: int64

回答4:

I parsed this as strings of lists, by copying OP's data and using pandas.read_clipboard(). In case this was indeed the case (a series of strings of lists), this solution would work:

df = df.merge(df.person_name.str.split(',', expand=True), left_index=True, right_index=True)
df = df[[0, 1, 2, 'salary']].melt(id_vars = 'salary').drop(columns='variable')

# Some cleaning up, then a simple groupby
df.value = df.value.str.replace('[', '')
df.value = df.value.str.replace(']', '')
df.value = df.value.str.replace(' ', '')
df.groupby('value')['salary'].sum()

Output:

value
alexander    171000
bob           56000
gates        135000
robert       145000
smith        110000
william      125000

回答5:

Another way you can do this is with iterrows(). This will not be as fast jezraels solution. But it works:

ids = []
names = []
salarys = []

# Iterate over the rows and extract the names from the lists in person_name column
for ix, row in df.iterrows():
    for name in row['person_name']:
        ids.append(row['id'])
        names.append(name)
        salarys.append(row['salary'])

# Create a new 'unnested' dataframe
df_new = pd.DataFrame({'id':ids,
                       'names':names,
                       'salary':salarys})

# Groupby on person_name and get the sum
print(df_new.groupby('names').salary.sum().reset_index())

Output

       names  salary
0  alexander  171000
1        bob   56000
2      gates  135000
3     robert  145000
4      smith  110000
5    william  125000

来源：https://stackoverflow.com/questions/55124329/pandas-group-by-splitting-string-value-in-all-rows-a-column-and-aggregation-f

标签

python

pandas

numpy