问题
I have a dataset with potentially duplicate records of the identifier appkey
. The duplicated records should ideally not exist and therefore I take them to be data collection mistakes. I need to drop all instances of an appkey
which occurs more than once.
The drop_duplicates
method is not useful in this case (or is it?) as it either selects the first or the last of the duplicates. Is there any obvious idiom to achieve this with pandas?
回答1:
As of pandas version 0.12, we have filter
for this. It does exactly what @Andy's solution does using transform
, but a little more succinctly and somewhat faster.
df.groupby('AppKey').filter(lambda x: x.count() == 1)
To steal @Andy's example,
In [1]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['AppKey', 'B'])
In [2]: df.groupby('AppKey').filter(lambda x: x.count() == 1)
Out[2]:
AppKey B
2 5 6
回答2:
Here's one way, using a transform with count:
In [1]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['AppKey', 'B'])
In [2]: df
Out[2]:
AppKey B
0 1 2
1 1 4
2 5 6
Groupby the AppKey column and applying a transform count, means that each occurrence of AppKey is counted and the count is assigned to those rows where it appears:
In [3]: count_appkey = df.groupby('AppKey')['AppKey'].transform('count')
In [4]: count_appkey
Out[4]:
0 2
1 2
2 1
Name: AppKey, dtype: int64
In [5]: count_appkey == 1
Out[5]:
0 False
1 False
2 True
Name: AppKey, dtype: bool
You can then use this as a boolean mask to the original DataFrame (leaving only those rows whose AppKey occurs precisely once):
In [6]: df[count_appkey == 1]
Out[6]:
AppKey B
2 5 6
回答3:
In pandas version 0.17 the drop_duplicates function has a 'keep' parameter that can be set to 'False' to keep no duplicated entries (other options are keep='first' and keep='last'). So, in this case:
df.drop_duplicates(subset=['appkey'],keep=False)
回答4:
The following solution using set operations works for me. It is significantly faster, though slightly more verbose, than the filter
solution:
In [1]: import pandas as pd
In [2]: def dropalldups(df, key):
...: first = df.duplicated(key) # really all *but* first
...: last = df.duplicated(key, take_last=True)
...: return df.reindex(df.index - df[first | last].index)
...:
In [3]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['AppKey', 'B'])
In [4]: dropalldups(df, 'AppKey')
Out[4]:
AppKey B
2 5 6
[1 rows x 2 columns]
In [5]: %timeit dropalldups(df, 'AppKey')
1000 loops, best of 3: 379 µs per loop
In [6]: %timeit df.groupby('AppKey').filter(lambda x: x.count() == 1)
1000 loops, best of 3: 1.57 ms per loop
On larger datasets, the performance difference is much more dramatic. Here are results for a DataFrame which has 44k rows. The column I'm filtering on is a 6-character string. There are 870 occurrences of 560 duplicate values:
In [94]: %timeit dropalldups(eq, 'id')
10 loops, best of 3: 26.1 ms per loop
In [95]: %timeit eq.groupby('id').filter(lambda x: x.count() == 1)
1 loops, best of 3: 13.1 s per loop
来源:https://stackoverflow.com/questions/18851216/pandas-drop-all-records-of-duplicate-indices