Pandas: Selecting rows for which groupby.sum() satisfies condition

大城市里の小女人 提交于 2020-01-21 11:30:13

问题


In pandas I have a dataframe of the form:

>>> import pandas as pd  
>>> df = pd.DataFrame({'ID':[51,51,51,24,24,24,31], 'x':[0,1,0,0,1,1,0]})
>>> df

ID   x
51   0
51   1
51   0
24   0
24   1
24   1
31   0

For every 'ID' the value of 'x' is recorded several times, it is either 0 or 1. I want to select those rows from df that contain an 'ID' for which 'x' is 1 at least twice.

For every 'ID' I manage to count the number of times 'x' is 1, by

>>> df.groupby('ID')['x'].sum()

ID
51    1
24    2
31    0

But I don't know how to proceed from here. I would like the following output:

ID   x
24   0
24   1
24   1

回答1:


Use groupby and filter

df.groupby('ID').filter(lambda s: s.x.sum()>=2)

Output:

   ID  x
3  24  0
4  24  1
5  24  1



回答2:


df = pd.DataFrame({'ID':[51,51,51,24,24,24,31], 'x':[0,1,0,0,1,1,0]})
df.loc[df.groupby(['ID'])['x'].transform(func=sum)>=2,:]
out:
   ID  x
3  24  0
4  24  1
5  24  1



回答3:


Using np.bincount and pd.factorize
alternative advance technique to draw better performance

f, u = df.ID.factorize()
df[np.bincount(f, df.x.values)[f] >= 2]

   ID  x
3  24  0
4  24  1
5  24  1

In obnoxious one-liner form

df[(lambda f, w: np.bincount(f, w)[f] >= 2)(df.ID.factorize()[0], df.x.values)]

   ID  x
3  24  0
4  24  1
5  24  1

np.bincount and np.unique
I could've used np.unique with the return_inverse parameter to accomplish the same exact thing. But, np.unique will sort the array and will change the time complexity of the solution.

u, f = np.unique(df.ID.values, return_inverse=True)
df[np.bincount(f, df.x.values)[f] >= 2]

One-liner

df[(lambda f, w: np.bincount(f, w)[f] >= 2)(np.unique(df.ID.values, return_inverse=True)[1], df.x.values)]

Timing

%timeit df[(lambda f, w: np.bincount(f, w)[f] >= 2)(df.ID.factorize()[0], df.x.values)]
%timeit df[(lambda f, w: np.bincount(f, w)[f] >= 2)(np.unique(df.ID.values, return_inverse=True)[1], df.x.values)]
%timeit df.groupby('ID').filter(lambda s: s.x.sum()>=2)
%timeit df.loc[df.groupby(['ID'])['x'].transform(func=sum)>=2]
%timeit df.loc[df.groupby(['ID'])['x'].transform('sum')>=2]

small data

1000 loops, best of 3: 302 µs per loop
1000 loops, best of 3: 241 µs per loop
1000 loops, best of 3: 1.52 ms per loop
1000 loops, best of 3: 1.2 ms per loop
1000 loops, best of 3: 1.21 ms per loop

large data

np.random.seed([3,1415])
df = pd.DataFrame(dict(
        ID=np.random.randint(100, size=10000),
        x=np.random.randint(2, size=10000)
    ))

1000 loops, best of 3: 528 µs per loop
1000 loops, best of 3: 847 µs per loop
10 loops, best of 3: 20.9 ms per loop
1000 loops, best of 3: 1.47 ms per loop
1000 loops, best of 3: 1.55 ms per loop

larger data

np.random.seed([3,1415])
df = pd.DataFrame(dict(
        ID=np.random.randint(100, size=100000),
        x=np.random.randint(2, size=100000)
    ))

1000 loops, best of 3: 2.01 ms per loop
100 loops, best of 3: 6.44 ms per loop
10 loops, best of 3: 29.4 ms per loop
100 loops, best of 3: 3.84 ms per loop
100 loops, best of 3: 3.74 ms per loop


来源:https://stackoverflow.com/questions/44531696/pandas-selecting-rows-for-which-groupby-sum-satisfies-condition

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!