How to analyze all duplicate entries in this Pandas DataFrame?

后端 未结 3 1964
爱一瞬间的悲伤
爱一瞬间的悲伤 2020-12-05 03:13

I\'d like to be able to compute descriptive statistics on data in a Pandas DataFrame, but I only care about duplicated entries. For example, let\'s say I have the DataFrame

相关标签:
3条回答
  • 2020-12-05 03:37

    Here's one possible solution to return all duplicated values in the two columns (i.e. rows 0, 1, 3, 4, 6, 7):

    >>> key1_dups = frame.key1[frame.key1.duplicated()].values
    >>> key2_dups = frame.key2[frame.key2.duplicated()].values
    >>> frame[frame.key1.isin(key1_dups) & frame.key2.isin(key2_dups)]
       key1  key2  data
    0     1     2     5
    1     2     2     6
    3     1     2     6
    4     2     2     1
    6     2     2     2
    7     2     2     8
    

    (Edit: actually, the df.duplicated(take_last=True) | df.duplicated() method in @Yoel's answer is neater.)

    To query the results of your groupby operation, you can use loc. For example:

    >>> dups = frame[frame.key1.isin(key1_dups) & frame.key2.isin(key2_dups)]
    >>> grouped = dups.groupby(['key1','key2']).min()
    >>> grouped
               data
    key1 key2      
    1    2        5
    2    2        1
    
    >>> grouped.loc[1, 2]
        data    5
    Name: (1, 2), dtype: int64
    

    Alternatively, turn grouped back into a "normal-looking" DataFrame by resetting both indexes:

    >>> grouped.reset_index(level=0).reset_index(level=0)
       key2  key1  data
    0     2     1     5
    1     2     2     1
    
    0 讨论(0)
  • 2020-12-05 03:45

    EDIT for Pandas 0.17 or later:

    As the take_last argument of the duplicated() method was deprecated in favour of the new keep argument since Pandas 0.17, please refer to this answer for the correct approach:

    • Invoke the duplicated() method with keep=False, i.e. frame.duplicated(['key1', 'key2'], keep=False).

    Therefore, in order to extract the required data for this specific question, the following suffices:

    In [81]: frame[frame.duplicated(['key1', 'key2'], keep=False)].groupby(('key1', 'key2')).min()
    Out[81]: 
               data
    key1 key2      
    1    2        5
    2    2        1
    
    [2 rows x 1 columns]
    

    Interestingly enough, this change in Pandas 0.17 may be partially attributed to this question, as referred to in this issue.


    For versions preceding Pandas 0.17:

    We can play with the take_last argument of the duplicated() method:

    take_last: boolean, default False

    For a set of distinct duplicate rows, flag all but the last row as duplicated. Default is for all but the first row to be flagged.

    If we set take_last's value to True, we flag all but the last duplicate row. Combining this along with its default value of False, which flags all but the first duplicate row, allows us to flag all duplicated rows:

    In [76]: frame.duplicated(['key1', 'key2'])
    Out[76]: 
    0    False
    1    False
    2    False
    3     True
    4     True
    5    False
    6     True
    7     True
    dtype: bool
    
    In [77]: frame.duplicated(['key1', 'key2'], take_last=True)
    Out[77]: 
    0     True
    1     True
    2    False
    3    False
    4     True
    5    False
    6     True
    7    False
    dtype: bool
    
    In [78]: frame.duplicated(['key1', 'key2'], take_last=True) | frame.duplicated(['key1', 'key2'])
    Out[78]: 
    0     True
    1     True
    2    False
    3     True
    4     True
    5    False
    6     True
    7     True
    dtype: bool
    
    In [79]: frame[frame.duplicated(['key1', 'key2'], take_last=True) | frame.duplicated(['key1', 'key2'])]
    Out[79]: 
       key1  key2  data
    0     1     2     5
    1     2     2     6
    3     1     2     6
    4     2     2     1
    6     2     2     2
    7     2     2     8
    
    [6 rows x 3 columns]
    

    Now we just need to use the groupby and min methods, and I believe the output is in the required format:

    In [81]: frame[frame.duplicated(['key1', 'key2'], take_last=True) | frame.duplicated(['key1', 'key2'])].groupby(('key1', 'key2')).min()
    Out[81]: 
               data
    key1 key2      
    1    2        5
    2    2        1
    
    [2 rows x 1 columns]
    
    0 讨论(0)
  • 2020-12-05 03:59

    To get a list of all the duplicated entries with Pandas version 0.17, you can simply set 'keep = False' in the duplicated function.

    frame[frame.duplicated(['key1','key2'],keep=False)]
    
        key1  key2  data
    0     1     2     5
    1     2     2     6
    3     1     2     6
    4     2     2     1
    6     2     2     2
    7     2     2     8
    
    0 讨论(0)
提交回复
热议问题