Python Pandas Drop Duplicates keep second to last

前端 未结 2 457
夕颜
夕颜 2021-02-07 23:35

What\'s the most efficient way to select the second to last of each duplicated set in a pandas dataframe?

For instance I basically want to do this operation:

<         


        
2条回答
  •  不要未来只要你来
    2021-02-08 00:21

    You could groupby/tail(2) to take the last 2 items, then groupby/head(1) to take the first item from the tail:

    df.groupby(['A','B']).tail(2).groupby(['A','B']).head(1)
    

    If there is only one item in the group, tail(2) returns just the one item.


    For example,

    import numpy as np
    import pandas as pd
    
    df = pd.DataFrame(np.random.randint(10, size=(10**2, 3)), columns=list('ABC'))
    result = df.groupby(['A','B']).tail(2).groupby(['A','B']).head(1)
    
    expected = (df.groupby(['A', 'B'], as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[-2]]).reset_index(level=0, drop=True))
    assert expected.sort_index().equals(result)
    

    The builtin groupby methods (such as tail and head) are often much faster than groupby/apply with custom Python functions. This is especially true if there are a lot of groups:

    In [96]: %timeit df.groupby(['A','B']).tail(2).groupby(['A','B']).head(1)
    1000 loops, best of 3: 1.7 ms per loop
    
    In [97]: %timeit (df.groupby(['A', 'B'], as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[-2]]).reset_index(level=0, drop=True))
    100 loops, best of 3: 17.9 ms per loop
    

    Alternatively, ayhan suggests a nice improvement:

    alt = df.groupby(['A','B']).tail(2).drop_duplicates(['A','B'])
    assert expected.sort_index().equals(alt)
    
    In [99]: %timeit df.groupby(['A','B']).tail(2).drop_duplicates(['A','B'])
    1000 loops, best of 3: 1.43 ms per loop
    

提交回复
热议问题