Remove rows with duplicate indices (Pandas DataFrame and TimeSeries)

后端未结

关注

 6  937

我寻月下人不归 2020-11-22 14:31

I\'m reading some automated weather data from the web. The observations occur every 5 minutes and are compiled into monthly files for each weather station. Once I\'m done pa

6条回答

心在旅途 (楼主)

2020-11-22 15:28
I would suggest using the duplicated method on the Pandas Index itself:
```
df3 = df3[~df3.index.duplicated(keep='first')]
```
While all the other methods work, the currently accepted answer is by far the least performant for the provided example. Furthermore, while the groupby method is only slightly less performant, I find the duplicated method to be more readable.

Using the sample data provided:
```
>>> %timeit df3.reset_index().drop_duplicates(subset='index', keep='first').set_index('index')
1000 loops, best of 3: 1.54 ms per loop

>>> %timeit df3.groupby(df3.index).first()
1000 loops, best of 3: 580 µs per loop

>>> %timeit df3[~df3.index.duplicated(keep='first')]
1000 loops, best of 3: 307 µs per loop
```
Note that you can keep the last element by changing the keep argument to 'last'.

It should also be noted that this method works with MultiIndex as well (using df1 as specified in Paul's example):
```
>>> %timeit df1.groupby(level=df1.index.names).last()
1000 loops, best of 3: 771 µs per loop

>>> %timeit df1[~df1.index.duplicated(keep='last')]
1000 loops, best of 3: 365 µs per loop
```
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...