How to drop rows that are not exact duplicates but contain no new information (more NaN)

不羁岁月 提交于 2020-01-25 09:17:06

问题


My goal is to collapse the below table into one single column. For this question specifically, I am asking how I can intelligently delete the yellow row because it is a duplicate of the gray row, although with less information.

The table has three categorical variables and 6 analysis/quantitative variables. Columns C1 and C2 are the only variables that need to match for a successful join; all of the . All blank cells are NaNs and python code for copying is below.

Question 1. (Yellow) All of the quantitative information stored in the yellow row is also stored in the grey row. The grey row has more information. Is there a way to intelligently delete a row of this type, similar to the Pandas drop_duplicates function? A hypothetical option would be df.drop_duplicates(subset=df.columns[4:], ignoreNaNs=True)

Related Question (Blue) How to join two rows that have the same keys and complementary values

Data table



Current Progress

My current code includes this line to drop all rows where all quantitative variables are NaN.
df.dropna(subset=df.columns[4:],how='all', inplace=True)

Also, this line for deleting all rows where all quantitative variables are the same.
df.drop_duplicates(subset=df.columns[4:], inplace=True)

Example code that can be copied into an IDE.

import pandas as pd

dfO = [['S1','P3','H1',Timestamp('2004-12-04 00:00:00'),-15.0,-27.4,nan,-10.0,-15.0,nan],
 ['S1','P3','H1',Timestamp('2004-12-20 00:00:00'),nan,nan,nan,nan,nan,nan],
 ['S1','P3','H2',Timestamp('2004-12-20 00:00:00'),-15.0,nan,nan,-10.0,nan,nan],
 ['S1','P3','H3',Timestamp('2004-12-07 00:00:00'),nan,nan,nan,nan,-15.0,-8.0],
 ['S1','P3','H1', Timestamp('2004-12-04 00:00:00'), -15.0,-27.4,nan,-10.0, -15.0, nan]]
cols = ['C1 (PK)', 'C2 (FK)', 'C3', 'C4', 'Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6']
df = pd.DataFrame(data=dfO,columns=cols)

df.drop_duplicates(inplace=True)
df.dropna(subset=df.columns[4:],how='all', inplace=True)
df.drop_duplicates(subset=df.columns[4:], inplace=True)

来源:https://stackoverflow.com/questions/59772372/how-to-drop-rows-that-are-not-exact-duplicates-but-contain-no-new-information-m

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!