Enumerate each row for each group in a DataFrame

匿名 (未验证) 提交于 2019-12-03 02:16:02

问题:

In pandas, how can I add a new column which enumerates rows based on a given grouping?

For instance, assume the following DataFrame:

import pandas as pd import numpy as np  a_list = ['A', 'B', 'C', 'A', 'A', 'C', 'B', 'B', 'A', 'C'] df = pd.DataFrame({'col_a': a_list, 'col_b': range(10)}) df   col_a  col_b 0     A      0 1     B      1 2     C      2 3     A      3 4     A      4 5     C      5 6     B      6 7     B      7 8     A      8 9     C      9

I'd like to add a col_c that gives me the Nth row of the "group" based on a grouping of col_a and sorting of col_b.

Desired output:

  col_a  col_b  col_c 0     A      0      1 3     A      3      2 4     A      4      3 8     A      8      4 1     B      1      1 6     B      6      2 7     B      7      3 2     C      2      1 5     C      5      2 9     C      9      3

I'm struggling to get to col_c. You can get to the proper grouping and sorting with .sort_index(by=['col_a', 'col_b']), it's now a matter of getting to that new column and labeling each row.

回答1:

There's cumcount, for precisely this case:

df['col_c'] = g.cumcount()

As it says in the docs:

Number each item in each group from 0 to the length of that group - 1.


Original answer (before cumcount was defined).

You could create a helper function to do this:

def add_col_c(x):     x['col_c'] = np.arange(len(x))     return x

First sort by column col_a:

In [11]: df.sort('col_a', inplace=True)

then apply this function across each group:

In [12]: g = df.groupby('col_a', as_index=False)  In [13]: g.apply(add_col_c) Out[13]:   col_a  col_b  col_c 3     A      3      0 8     A      8      1 0     A      0      2 4     A      4      3 6     B      6      0 1     B      1      1 7     B      7      2 9     C      9      0 2     C      2      1 5     C      5      2

In order to get 1,2,... you couls use np.arange(1, len(x) + 1).



回答2:

The given answers both involve calling a python function for each group, and if you have many groups a vectorized approach should be faster (I havent checked).

Here is my pure numpy suggestion:

In [5]: df.sort(['col_a', 'col_b'], inplace=True, ascending=(False, False)) In [6]: sizes = df.groupby('col_a', sort=False).size().values In [7]: df['col_c'] = np.arange(sizes.sum()) - np.repeat(sizes.cumsum() - sizes, sizes) In [8]: print df   col_a  col_b  col_c 9     C      9      0 5     C      5      1 2     C      2      2 7     B      7      0 6     B      6      1 1     B      1      2 8     A      8      0 4     A      4      1 3     A      3      2 0     A      0      3


回答3:

You could define your own function to deal with that:

In [58]: def func(x):    ....:     x['col_c'] = x['col_a'].argsort() + 1     ....:     return x    ....:   In [59]: df.groupby('col_a').apply(func) Out[59]:    col_a  col_b  col_c 0     A      0      1    3     A      3      2    4     A      4      3    8     A      8      4    1     B      1      1    6     B      6      2    7     B      7      3    2     C      2      1    5     C      5      2    9     C      9      3


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!