问题
Using df.dropna(thresh = x, inplace=True)
, I can successfully drop the rows lacking at least x
non-nan values.
But because my df looks like:
2001 2002 2003 2004
bob A 123 31 4 12
bob B 41 1 56 13
bob C nan nan 4 nan
bill A 451 8 nan 24
bill B 32 5 52 6
bill C 623 12 41 14
#Repeating features (A,B,C) for each index/name
This drops the one row/instance where the thresh=
condition is met, but leaves the other instances of that feature.
What I want is something that drops the entire feature, if the
thresh
is met for any one row, such as:
df.dropna(thresh = 2, inplace=True):
2001 2002 2003 2004
bob A 123 31 4 12
bob B 41 1 56 13
bill A 451 8 nan 24
bill B 32 5 52 6
#Drops C from the whole df
wherein C
is removed from the entire df, not just the one time it meets the condition under bob
回答1:
Your sample looks like a multiindex index dataframe where index level 1 is the feature A, B, C
and index level 0 is names. You may use notna
and sum
to create a mask to identify rows where number of non-nan values less than 2 and get their index level 1 values. Finall, use df.query
to slice rows
a = df.notna().sum(1).lt(2).loc[lambda x: x].index.get_level_values(1)
df_final = df.query('ilevel_1 not in @a')
Out[275]:
2001 2002 2003 2004
bob A 123.0 31.0 4.0 12.0
B 41.0 1.0 56.0 13.0
bill A 451.0 8.0 NaN 24.0
B 32.0 5.0 52.0 6.0
Method 2:
Use notna
, sum
, groupby
and transform
to create mask True
on groups having non-nan values greater than or equal 2. Finally, use this mask to slice rows
m = df.notna().sum(1).groupby(level=1).transform(lambda x: x.ge(2).all())
df_final = df[m]
Out[296]:
2001 2002 2003 2004
bob A 123.0 31.0 4.0 12.0
B 41.0 1.0 56.0 13.0
bill A 451.0 8.0 NaN 24.0
B 32.0 5.0 52.0 6.0
回答2:
Keep only the rows with at least 5 non-NA values.
df.dropna(thresh=5)
thresh is for including rows with a minimum number of non-NaN
来源:https://stackoverflow.com/questions/59593901/python-drop-all-instances-of-feature-from-df-if-nan-thresh-is-met