Python - Delete duplicates in a dataframe based on two columns combinations?

前端未结

关注

 3  1228

名媛妹妹 2020-11-29 10:53

I have a dataframe with 3 columns in Python:

Name1 Name2 Value
Juan  Ale   1
Ale   Juan  1

and would like to eliminate the duplicates based

3条回答

北荒 (楼主)

2020-11-29 11:45

Know Im kinda late for this question but giving my contribution anyway :)

You can also use get_dummies and add for a good way of creating hashable rows

df[~(pd.get_dummies(df.a).add(pd.get_dummies(df.b), fill_value=0)).duplicated()]

Times are not as good as @Wen's answer, but it isstill way faster than apply+frozen_set

df=pd.concat([df]*1000000)
%timeit df[~(pd.get_dummies(df.a).add(pd.get_dummies(df.b), fill_value=0)).duplicated()]
1.8 s ± 85 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df[pd.DataFrame(np.sort(df[['a','b']].values,1)).duplicated()]
1.26 s ± 19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df[~df[['a', 'b']].apply(frozenset, axis=1).duplicated()]
1min 9s ± 684 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

0 讨论(0)

查看其它3个回答