Pandas: Drop all records of duplicate indices

问题

I have a dataset with potentially duplicate records of the identifier appkey. The duplicated records should ideally not exist and therefore I take them to be data collection mistakes. I need to drop all instances of an appkey which occurs more than once.

The drop_duplicates method is not useful in this case (or is it?) as it either selects the first or the last of the duplicates. Is there any obvious idiom to achieve this with pandas?

回答1:

As of pandas version 0.12, we have filter for this. It does exactly what @Andy's solution does using transform, but a little more succinctly and somewhat faster.

df.groupby('AppKey').filter(lambda x: x.count() == 1)

To steal @Andy's example,

In [1]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['AppKey', 'B'])

In [2]: df.groupby('AppKey').filter(lambda x: x.count() == 1)
Out[2]: 
   AppKey  B
2       5  6

回答2:

Here's one way, using a transform with count:

In [1]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['AppKey', 'B'])

In [2]: df
Out[2]:
   AppKey  B
0       1  2
1       1  4
2       5  6

Groupby the AppKey column and applying a transform count, means that each occurrence of AppKey is counted and the count is assigned to those rows where it appears:

In [3]: count_appkey = df.groupby('AppKey')['AppKey'].transform('count')

In [4]: count_appkey
Out[4]:
0    2
1    2
2    1
Name: AppKey, dtype: int64

In [5]: count_appkey == 1
Out[5]:
0    False
1    False
2     True
Name: AppKey, dtype: bool

You can then use this as a boolean mask to the original DataFrame (leaving only those rows whose AppKey occurs precisely once):

In [6]: df[count_appkey == 1]
Out[6]:
   AppKey  B
2       5  6

回答3:

In pandas version 0.17 the drop_duplicates function has a 'keep' parameter that can be set to 'False' to keep no duplicated entries (other options are keep='first' and keep='last'). So, in this case:

df.drop_duplicates(subset=['appkey'],keep=False)

回答4:

The following solution using set operations works for me. It is significantly faster, though slightly more verbose, than the filter solution:

In [1]: import pandas as pd
In [2]: def dropalldups(df, key):
   ...:     first = df.duplicated(key)  # really all *but* first
   ...:     last = df.duplicated(key, take_last=True)
   ...:     return df.reindex(df.index - df[first | last].index)
   ...: 
In [3]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['AppKey', 'B'])
In [4]: dropalldups(df, 'AppKey')
Out[4]: 
   AppKey  B
2       5  6

[1 rows x 2 columns]
In [5]: %timeit dropalldups(df, 'AppKey')
1000 loops, best of 3: 379 µs per loop
In [6]: %timeit df.groupby('AppKey').filter(lambda x: x.count() == 1)
1000 loops, best of 3: 1.57 ms per loop

On larger datasets, the performance difference is much more dramatic. Here are results for a DataFrame which has 44k rows. The column I'm filtering on is a 6-character string. There are 870 occurrences of 560 duplicate values:

In [94]: %timeit dropalldups(eq, 'id')
10 loops, best of 3: 26.1 ms per loop
In [95]: %timeit eq.groupby('id').filter(lambda x: x.count() == 1)
1 loops, best of 3: 13.1 s per loop

来源：https://stackoverflow.com/questions/18851216/pandas-drop-all-records-of-duplicate-indices

标签

pandas

duplicates