I\'d like to be able to compute descriptive statistics on data in a Pandas DataFrame, but I only care about duplicated entries. For example, let\'s say I have the DataFrame
Here's one possible solution to return all duplicated values in the two columns (i.e. rows 0, 1, 3, 4, 6, 7):
>>> key1_dups = frame.key1[frame.key1.duplicated()].values
>>> key2_dups = frame.key2[frame.key2.duplicated()].values
>>> frame[frame.key1.isin(key1_dups) & frame.key2.isin(key2_dups)]
key1 key2 data
0 1 2 5
1 2 2 6
3 1 2 6
4 2 2 1
6 2 2 2
7 2 2 8
(Edit: actually, the df.duplicated(take_last=True) | df.duplicated()
method in @Yoel's answer is neater.)
To query the results of your groupby
operation, you can use loc
. For example:
>>> dups = frame[frame.key1.isin(key1_dups) & frame.key2.isin(key2_dups)]
>>> grouped = dups.groupby(['key1','key2']).min()
>>> grouped
data
key1 key2
1 2 5
2 2 1
>>> grouped.loc[1, 2]
data 5
Name: (1, 2), dtype: int64
Alternatively, turn grouped
back into a "normal-looking" DataFrame by resetting both indexes:
>>> grouped.reset_index(level=0).reset_index(level=0)
key2 key1 data
0 2 1 5
1 2 2 1
EDIT for Pandas 0.17 or later:
As the take_last
argument of the duplicated()
method was deprecated in favour of the new keep
argument since Pandas 0.17, please refer to this answer for the correct approach:
duplicated()
method with keep=False
, i.e. frame.duplicated(['key1', 'key2'], keep=False)
.Therefore, in order to extract the required data for this specific question, the following suffices:
In [81]: frame[frame.duplicated(['key1', 'key2'], keep=False)].groupby(('key1', 'key2')).min()
Out[81]:
data
key1 key2
1 2 5
2 2 1
[2 rows x 1 columns]
Interestingly enough, this change in Pandas 0.17 may be partially attributed to this question, as referred to in this issue.
For versions preceding Pandas 0.17:
We can play with the take_last
argument of the duplicated() method:
take_last
:boolean
, defaultFalse
For a set of distinct duplicate rows, flag all but the last row as duplicated. Default is for all but the first row to be flagged.
If we set take_last
's value to True
, we flag all but the last duplicate row. Combining this along with its default value of False
, which flags all but the first duplicate row, allows us to flag all duplicated rows:
In [76]: frame.duplicated(['key1', 'key2'])
Out[76]:
0 False
1 False
2 False
3 True
4 True
5 False
6 True
7 True
dtype: bool
In [77]: frame.duplicated(['key1', 'key2'], take_last=True)
Out[77]:
0 True
1 True
2 False
3 False
4 True
5 False
6 True
7 False
dtype: bool
In [78]: frame.duplicated(['key1', 'key2'], take_last=True) | frame.duplicated(['key1', 'key2'])
Out[78]:
0 True
1 True
2 False
3 True
4 True
5 False
6 True
7 True
dtype: bool
In [79]: frame[frame.duplicated(['key1', 'key2'], take_last=True) | frame.duplicated(['key1', 'key2'])]
Out[79]:
key1 key2 data
0 1 2 5
1 2 2 6
3 1 2 6
4 2 2 1
6 2 2 2
7 2 2 8
[6 rows x 3 columns]
Now we just need to use the groupby and min
methods, and I believe the output is in the required format:
In [81]: frame[frame.duplicated(['key1', 'key2'], take_last=True) | frame.duplicated(['key1', 'key2'])].groupby(('key1', 'key2')).min()
Out[81]:
data
key1 key2
1 2 5
2 2 1
[2 rows x 1 columns]
To get a list of all the duplicated entries with Pandas version 0.17, you can simply set 'keep = False' in the duplicated function.
frame[frame.duplicated(['key1','key2'],keep=False)]
key1 key2 data
0 1 2 5
1 2 2 6
3 1 2 6
4 2 2 1
6 2 2 2
7 2 2 8