Remove rows with duplicate indices (Pandas DataFrame and TimeSeries)

后端 未结 6 937
我寻月下人不归
我寻月下人不归 2020-11-22 14:31

I\'m reading some automated weather data from the web. The observations occur every 5 minutes and are compiled into monthly files for each weather station. Once I\'m done pa

6条回答
  •  心在旅途
    2020-11-22 15:28

    I would suggest using the duplicated method on the Pandas Index itself:

    df3 = df3[~df3.index.duplicated(keep='first')]
    

    While all the other methods work, the currently accepted answer is by far the least performant for the provided example. Furthermore, while the groupby method is only slightly less performant, I find the duplicated method to be more readable.

    Using the sample data provided:

    >>> %timeit df3.reset_index().drop_duplicates(subset='index', keep='first').set_index('index')
    1000 loops, best of 3: 1.54 ms per loop
    
    >>> %timeit df3.groupby(df3.index).first()
    1000 loops, best of 3: 580 µs per loop
    
    >>> %timeit df3[~df3.index.duplicated(keep='first')]
    1000 loops, best of 3: 307 µs per loop
    

    Note that you can keep the last element by changing the keep argument to 'last'.

    It should also be noted that this method works with MultiIndex as well (using df1 as specified in Paul's example):

    >>> %timeit df1.groupby(level=df1.index.names).last()
    1000 loops, best of 3: 771 µs per loop
    
    >>> %timeit df1[~df1.index.duplicated(keep='last')]
    1000 loops, best of 3: 365 µs per loop
    

提交回复
热议问题