(pandas) Drop duplicates based on subset where order doesn't matter

前端 未结 2 1016
情歌与酒
情歌与酒 2020-12-11 21:49

What is the proper way to go from this df:

>>> df=pd.DataFrame({\'a\':[\'jeff\',\'bob\',\'jill\'], \'b\':[\'bob\',\'jeff\',\'mike\']})
>>>          


        
相关标签:
2条回答
  • 2020-12-11 22:40

    I think simpliest is use apply with axis=1 for sorted per rows and then call DataFrame.duplicated:

    df = df[~df.apply(sorted, 1).duplicated()]
    print (df)
          a     b
    0  jeff   bob
    2  jill  mike
    

    A bit complicated, but very fast, is use numpy.sort with DataFrame constructor:

    df1 = pd.DataFrame(np.sort(df.values, axis=1), index=df.index, columns=df.columns)
    df = df[~df1.duplicated()]
    print (df)
          a     b
    0  jeff   bob
    2  jill  mike
    

    Timings:

    np.random.seed(123)
    N = 10000
    df = pd.DataFrame({'A': np.random.randint(100,size=N).astype(str),
                       'B': np.random.randint(100,size=N).astype(str)})
    #print (df)
    
    In [63]: %timeit (df[~pd.DataFrame(np.sort(df.values, axis=1), index=df.index, columns=df.columns).duplicated()])
    100 loops, best of 3: 3.25 ms per loop
    
    In [64]: %timeit (df[~df.apply(sorted, 1).duplicated()])
    1 loop, best of 3: 1.09 s per loop
    
    #Ted Petrou solution1
    In [65]: %timeit (df[~df.apply(lambda x: x.sort_values().values, axis=1).duplicated()])
    1 loop, best of 3: 2.89 s per loop
    
    #Ted Petrou solution2
    In [66]: %timeit (df[~df.T.apply(sorted).T.duplicated()])
    1 loop, best of 3: 1.56 s per loop
    
    0 讨论(0)
  • 2020-12-11 22:44

    I think you can sort each row independently and then use duplicated to see which ones to drop.

    dupes = df.apply(lambda x: x.sort_values().values, axis=1).duplicated()
    df[~dupes]
    

    A faster way to get dupes. Thanks to @DSM.

    dupes = df.T.apply(sorted).T.duplicated()
    
    0 讨论(0)
提交回复
热议问题