Select the max row per group - pandas performance issue

后端 未结 2 2045
闹比i
闹比i 2020-12-05 16:44

I\'m selecting one max row per group and I\'m using groupby/agg to return index values and select the rows using loc.

For exam

2条回答
  •  星月不相逢
    2020-12-05 17:16

    Using Numba's jit

    from numba import njit
    import numpy as np
    
    @njit
    def nidxmax(bins, k, weights):
        out = np.zeros(k, np.int64)
        trk = np.zeros(k)
        for i, w in enumerate(weights - (weights.min() - 1)):
            b = bins[i]
            if w > trk[b]:
                trk[b] = w
                out[b] = i
        return np.sort(out)
    
    def with_numba_idxmax(df):
        f, u = pd.factorize(df.Id)
        return df.iloc[nidxmax(f, len(u), df.delta.values)]
    

    Borrowing from @unutbu

    def make_df(N):
        # lots of small groups
        df = pd.DataFrame(np.random.randint(N//10+1, size=(N, 2)), columns=['Id','delta'])
        # few large groups
        # df = pd.DataFrame(np.random.randint(10, size=(N, 2)), columns=['Id','delta'])
        return df
    

    Prime jit

    with_numba_idxmax(make_df(10));
    

    Test

    df = make_df(2**20)
    
    
    %timeit with_numba_idxmax(df)
    %timeit using_sort_drop(df)
    
    47.4 ms ± 99.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    194 ms ± 451 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    

提交回复
热议问题