Remove duplicate rows from Pandas dataframe where only some columns have the same value

断了今生、忘了曾经 提交于 2019-12-17 02:38:12

问题


I have a pandas dataframe as follows:

A   B   C
1   2   x
1   2   y
3   4   z
3   5   x

I want that only 1 row remains of rows that share the same values in specific columns. In the example above I mean columns A and B. In other words, if the values of columns A and B occur more than once in the dataframe, only one row should remain (which one does not matter).

FWIW: the maximum number of so called duplicate rows (that is, where column A and B are the same) is 2.

The result should looke like this:

A   B   C
1   2   x
3   4   z
3   5   x

or

A   B   C
1   2   y
3   4   z
3   5   x

回答1:


Use drop_duplicates with parameter subset, for keeping only last duplicated rows add keep='last':

df1 = df.drop_duplicates(subset=['A','B'])
#same as
#df1 = df.drop_duplicates(subset=['A','B'], keep='first')
print (df1)
   A  B  C
0  1  2  x
2  3  4  z
3  3  5  x

df2 = df.drop_duplicates(subset=['A','B'], keep='last')
print (df2)
   A  B  C
1  1  2  y
2  3  4  z
3  3  5  x


来源:https://stackoverflow.com/questions/44481768/remove-duplicate-rows-from-pandas-dataframe-where-only-some-columns-have-the-sam

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!