Pandas: Drop all records of duplicate indices

眉间皱痕 提交于 2020-01-12 13:58:12

问题


I have a dataset with potentially duplicate records of the identifier appkey. The duplicated records should ideally not exist and therefore I take them to be data collection mistakes. I need to drop all instances of an appkey which occurs more than once.

The drop_duplicates method is not useful in this case (or is it?) as it either selects the first or the last of the duplicates. Is there any obvious idiom to achieve this with pandas?


回答1:


As of pandas version 0.12, we have filter for this. It does exactly what @Andy's solution does using transform, but a little more succinctly and somewhat faster.

df.groupby('AppKey').filter(lambda x: x.count() == 1)

To steal @Andy's example,

In [1]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['AppKey', 'B'])

In [2]: df.groupby('AppKey').filter(lambda x: x.count() == 1)
Out[2]: 
   AppKey  B
2       5  6



回答2:


Here's one way, using a transform with count:

In [1]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['AppKey', 'B'])

In [2]: df
Out[2]:
   AppKey  B
0       1  2
1       1  4
2       5  6

Groupby the AppKey column and applying a transform count, means that each occurrence of AppKey is counted and the count is assigned to those rows where it appears:

In [3]: count_appkey = df.groupby('AppKey')['AppKey'].transform('count')

In [4]: count_appkey
Out[4]:
0    2
1    2
2    1
Name: AppKey, dtype: int64

In [5]: count_appkey == 1
Out[5]:
0    False
1    False
2     True
Name: AppKey, dtype: bool

You can then use this as a boolean mask to the original DataFrame (leaving only those rows whose AppKey occurs precisely once):

In [6]: df[count_appkey == 1]
Out[6]:
   AppKey  B
2       5  6



回答3:


In pandas version 0.17 the drop_duplicates function has a 'keep' parameter that can be set to 'False' to keep no duplicated entries (other options are keep='first' and keep='last'). So, in this case:

df.drop_duplicates(subset=['appkey'],keep=False)



回答4:


The following solution using set operations works for me. It is significantly faster, though slightly more verbose, than the filter solution:

In [1]: import pandas as pd
In [2]: def dropalldups(df, key):
   ...:     first = df.duplicated(key)  # really all *but* first
   ...:     last = df.duplicated(key, take_last=True)
   ...:     return df.reindex(df.index - df[first | last].index)
   ...: 
In [3]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['AppKey', 'B'])
In [4]: dropalldups(df, 'AppKey')
Out[4]: 
   AppKey  B
2       5  6

[1 rows x 2 columns]
In [5]: %timeit dropalldups(df, 'AppKey')
1000 loops, best of 3: 379 µs per loop
In [6]: %timeit df.groupby('AppKey').filter(lambda x: x.count() == 1)
1000 loops, best of 3: 1.57 ms per loop

On larger datasets, the performance difference is much more dramatic. Here are results for a DataFrame which has 44k rows. The column I'm filtering on is a 6-character string. There are 870 occurrences of 560 duplicate values:

In [94]: %timeit dropalldups(eq, 'id')
10 loops, best of 3: 26.1 ms per loop
In [95]: %timeit eq.groupby('id').filter(lambda x: x.count() == 1)
1 loops, best of 3: 13.1 s per loop


来源:https://stackoverflow.com/questions/18851216/pandas-drop-all-records-of-duplicate-indices

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!