How to find duplicate names using pandas?

前端 未结 6 1272
無奈伤痛
無奈伤痛 2020-12-14 02:27

I have a pandas.DataFrame with a column called name containing strings. I would like to get a list of the names which occur more than once in the c

相关标签:
6条回答
  • 2020-12-14 02:47

    I had a similar problem and came across this answer.

    I guess this also works:

    counts = df.groupby('name').size()
    df2 = pd.DataFrame(counts, columns = ['size'])
    df2 = df2[df2.size>1]
    

    and df2.index will give you a list of names with duplicates

    0 讨论(0)
  • 2020-12-14 02:49

    A one liner can be:

    x.set_index('name').index.get_duplicates()
    

    the index contains a method for finding duplicates, columns does not seem to have a similar method..

    0 讨论(0)
  • 2020-12-14 02:49

    Most of the responses given demonstrate how to remove the duplicates, not find them.

    The following will select each row in the data frame with a duplicate 'name' field. Note that this will find each instance, not just duplicates after the first occurrence. The keep argument accepts additional values that can exclude either the first or last occurrence.

    df[df.duplicated(['name'], keep=False)]
    

    The pandas reference for duplicated() can be found here.

    0 讨论(0)
  • 2020-12-14 02:54

    Another one liner can be:

    (df.name).drop_duplicates()
    
    0 讨论(0)
  • 2020-12-14 02:55

    If you want to find the rows with duplicated name (except the first time we see that), you can try this

    In [16]: import pandas as pd
    In [17]: p1 = {'name': 'willy', 'age': 10}
    In [18]: p2 = {'name': 'willy', 'age': 11}
    In [19]: p3 = {'name': 'zoe', 'age': 10}
    In [20]: df = pd.DataFrame([p1, p2, p3])
    
    In [21]: df
    Out[21]: 
       age   name
    0   10  willy
    1   11  willy
    2   10    zoe
    
    In [22]: df.duplicated('name')
    Out[22]: 
    0    False
    1     True
    2    False
    
    0 讨论(0)
  • 2020-12-14 03:04

    value_counts will give you the number of duplicates as well.

    names = df.name.value_counts()
    names[names > 1]
    
    0 讨论(0)
提交回复
热议问题