可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
In pandas, how can I add a new column which enumerates rows based on a given grouping?
For instance, assume the following DataFrame:
import pandas as pd import numpy as np a_list = ['A', 'B', 'C', 'A', 'A', 'C', 'B', 'B', 'A', 'C'] df = pd.DataFrame({'col_a': a_list, 'col_b': range(10)}) df col_a col_b 0 A 0 1 B 1 2 C 2 3 A 3 4 A 4 5 C 5 6 B 6 7 B 7 8 A 8 9 C 9
I'd like to add a col_c
that gives me the Nth row of the "group" based on a grouping of col_a
and sorting of col_b
.
Desired output:
col_a col_b col_c 0 A 0 1 3 A 3 2 4 A 4 3 8 A 8 4 1 B 1 1 6 B 6 2 7 B 7 3 2 C 2 1 5 C 5 2 9 C 9 3
I'm struggling to get to col_c
. You can get to the proper grouping and sorting with .sort_index(by=['col_a', 'col_b'])
, it's now a matter of getting to that new column and labeling each row.
回答1:
There's cumcount, for precisely this case:
df['col_c'] = g.cumcount()
As it says in the docs:
Number each item in each group from 0 to the length of that group - 1.
Original answer (before cumcount was defined).
You could create a helper function to do this:
def add_col_c(x): x['col_c'] = np.arange(len(x)) return x
First sort by column col_a:
In [11]: df.sort('col_a', inplace=True)
then apply this function across each group:
In [12]: g = df.groupby('col_a', as_index=False) In [13]: g.apply(add_col_c) Out[13]: col_a col_b col_c 3 A 3 0 8 A 8 1 0 A 0 2 4 A 4 3 6 B 6 0 1 B 1 1 7 B 7 2 9 C 9 0 2 C 2 1 5 C 5 2
In order to get 1,2,...
you couls use np.arange(1, len(x) + 1)
.
回答2:
The given answers both involve calling a python function for each group, and if you have many groups a vectorized approach should be faster (I havent checked).
Here is my pure numpy suggestion:
In [5]: df.sort(['col_a', 'col_b'], inplace=True, ascending=(False, False)) In [6]: sizes = df.groupby('col_a', sort=False).size().values In [7]: df['col_c'] = np.arange(sizes.sum()) - np.repeat(sizes.cumsum() - sizes, sizes) In [8]: print df col_a col_b col_c 9 C 9 0 5 C 5 1 2 C 2 2 7 B 7 0 6 B 6 1 1 B 1 2 8 A 8 0 4 A 4 1 3 A 3 2 0 A 0 3
回答3:
You could define your own function to deal with that:
In [58]: def func(x): ....: x['col_c'] = x['col_a'].argsort() + 1 ....: return x ....: In [59]: df.groupby('col_a').apply(func) Out[59]: col_a col_b col_c 0 A 0 1 3 A 3 2 4 A 4 3 8 A 8 4 1 B 1 1 6 B 6 2 7 B 7 3 2 C 2 1 5 C 5 2 9 C 9 3