Drop duplicates, but ignore nulls

戏子无情 提交于 2021-02-07 13:26:58

问题


So I know you can use something like this to drop duplicate lines:

the_data.drop_duplicates(subset=['the_key'])

However, if the_key is null for some values, like below:

   the_key  C  D
1      NaN  *  *
2      NaN     *
3      111  *  *
4      111

It will keep the ones marked in the C column. Is it possible to get drop_duplicates to treat all nan as distinct and get an output keeping the data like in the D column?


回答1:


Use duplicated chained with isna and filter by boolean indexing:

df = df[(~df['the_key'].duplicated()) | df['the_key'].isna()]
#fol oldier pandas versions
#df = df[(~df['the_key'].duplicated()) | df['the_key'].isnull()]
print (df)
   the_key  C    D
1      NaN  *    *
2      NaN       * 
3    111.0  *    *



回答2:


I'd do it this way:

dupes = the_data.duplicated(subset=['the_key'])
dupes[the_data['the_key'].isnull()] = False
the_data = the_data[~dupes]


来源:https://stackoverflow.com/questions/50154835/drop-duplicates-but-ignore-nulls

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!