Group dataframe by multiple columns and append the result to the dataframe

…衆ロ難τιáo~ 提交于 2019-12-06 15:39:51
Roman Pekar

Merge grouped result with the initial DataFrame:

>>> df1 = df.groupby(['id','country'])['source'].apply(
             lambda x: x.tolist()).reset_index()

>>> df1
  id  country      source
0  1        1       [1.0]
1  1        2  [1.0, 2.0]
2  1        3       [1.0]
3  2        1       [1.0]

>>> df2 = df[['id', 'country']]
>>> df2
  id  country
1  1        1
2  1        2
3  1        2
4  1        3
5  2        1

>>> pd.merge(df1, df2, on=['id', 'country'])
  id  country      source
0  1        1       [1.0]
1  1        2  [1.0, 2.0]
2  1        2  [1.0, 2.0]
3  1        3       [1.0]
4  2        1       [1.0]

This can be achieved without the merge by reassigning the result of the groupby.apply to the original dataframe.

df = df.groupby(['id', 'country']).apply(lambda group: _add_sourcelist_col(group))

with your _add_sourcelist_col function being,

def _add_sourcelist_col(group):
    group['source_list'] = list(set(group.tolist()))
    return group

Note that additional columns can also be added in your defined function. Just simply add them to each group dataframe, and be sure to return the group at the end of your function declaration.

Edit: I'll leave the info above as it might still be useful, but I misinterpreted part of the original quesiton. What the OP was trying to accomplish can be done using,

df = df.groupby(['id', 'country']).apply(lambda x: addsource(x))

def addsource(x):
    x['source_list'] = list(set(x.source.tolist()))
    return x

An alternative method that avoids the post-facto merge is providing the index in the function applied to each group, e.g.

def calculate_on_group(x):
    fill_val = x.unique().tolist()
    return pd.Series([fill_val] * x.size, index=x.index)

df['source_list'] = df.groupby(['id','country'])['source'].apply(calculate_on_group)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!