What is the proper way to go from this df:
>>> df=pd.DataFrame({\'a\':[\'jeff\',\'bob\',\'jill\'], \'b\':[\'bob\',\'jeff\',\'mike\']})
>>>
I think simpliest is use apply with axis=1
for sorted per rows and then call DataFrame.duplicated:
df = df[~df.apply(sorted, 1).duplicated()]
print (df)
a b
0 jeff bob
2 jill mike
A bit complicated, but very fast, is use numpy.sort with DataFrame
constructor:
df1 = pd.DataFrame(np.sort(df.values, axis=1), index=df.index, columns=df.columns)
df = df[~df1.duplicated()]
print (df)
a b
0 jeff bob
2 jill mike
Timings:
np.random.seed(123)
N = 10000
df = pd.DataFrame({'A': np.random.randint(100,size=N).astype(str),
'B': np.random.randint(100,size=N).astype(str)})
#print (df)
In [63]: %timeit (df[~pd.DataFrame(np.sort(df.values, axis=1), index=df.index, columns=df.columns).duplicated()])
100 loops, best of 3: 3.25 ms per loop
In [64]: %timeit (df[~df.apply(sorted, 1).duplicated()])
1 loop, best of 3: 1.09 s per loop
#Ted Petrou solution1
In [65]: %timeit (df[~df.apply(lambda x: x.sort_values().values, axis=1).duplicated()])
1 loop, best of 3: 2.89 s per loop
#Ted Petrou solution2
In [66]: %timeit (df[~df.T.apply(sorted).T.duplicated()])
1 loop, best of 3: 1.56 s per loop
I think you can sort each row independently and then use duplicated to see which ones to drop.
dupes = df.apply(lambda x: x.sort_values().values, axis=1).duplicated()
df[~dupes]
A faster way to get dupes. Thanks to @DSM.
dupes = df.T.apply(sorted).T.duplicated()