Sample each group after pandas groupby

早过忘川 提交于 2019-12-17 18:34:08

问题


I know this must have been answered some where but I just could not find it.

Problem: Sample each group after groupby operation.

import pandas as pd

df = pd.DataFrame({'a': [1,2,3,4,5,6,7],
                   'b': [1,1,1,0,0,0,0]})

grouped = df.groupby('b')

# now sample from each group, e.g., I want 30% of each group

回答1:


Apply a lambda and call sample with param frac:

In [2]:
df = pd.DataFrame({'a': [1,2,3,4,5,6,7],
                   'b': [1,1,1,0,0,0,0]})
​
grouped = df.groupby('b')
grouped.apply(lambda x: x.sample(frac=0.3))

Out[2]:
     a  b
b        
0 6  7  0
1 2  3  1



回答2:


Sample a fraction of each group

You can use GroupBy.apply with sample. You do not need to use a lambda; apply accepts keyword arguments:

frac = .3
df.groupby('b').apply(pd.DataFrame.sample, frac=.3)
     a  b
b        
0 6  7  0
1 0  1  1

If the MultiIndex is not required, you may specify group_keys=False to groupby:

df.groupby('b', group_keys=False).apply(pd.DataFrame.sample, frac=.3)

   a  b
6  7  0
2  3  1

Sample N rows from each group

apply is slow. If your use case is to sample a fixed number of rows, you can shuffle the DataFrame beforehand, then use GroupBy.head.

df.sample(frac=1).groupby('b').head(2)

   a  b
2  3  1
5  6  0
1  2  1
4  5  0

This is the same as df.groupby('b', group_keys=False).apply(pd.DataFrame.sample, n=N), but faster:

%%timeit df.groupby('b', group_keys=False).apply(pd.DataFrame.sample, n=2)  
                                                 # 3.19 ms ± 90.5 µs
%timeit df.sample(frac=1).groupby('b').head(2)   # 1.56 ms ± 103 µs


来源:https://stackoverflow.com/questions/36390406/sample-each-group-after-pandas-groupby

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!