Python Drop all instances of Feature from DF if NaN thresh is met

与世无争的帅哥 提交于 2020-01-23 13:00:11

问题


Using df.dropna(thresh = x, inplace=True), I can successfully drop the rows lacking at least x non-nan values.

But because my df looks like:

          2001     2002     2003    2004

bob   A   123      31       4        12
bob   B   41        1       56       13
bob   C   nan      nan      4        nan

bill  A   451      8        nan      24
bill  B   32       5        52        6
bill  C   623      12       41       14

#Repeating features (A,B,C) for each index/name

This drops the one row/instance where the thresh= condition is met, but leaves the other instances of that feature.

What I want is something that drops the entire feature, if the thresh is met for any one row, such as:

df.dropna(thresh = 2, inplace=True):

           2001     2002     2003    2004

bob    A    123      31       4        12
bob    B    41        1       56       13

bill   A    451      8        nan      24
bill   B    32       5        52        6

#Drops C from the whole df

wherein C is removed from the entire df, not just the one time it meets the condition under bob


回答1:


Your sample looks like a multiindex index dataframe where index level 1 is the feature A, B, C and index level 0 is names. You may use notna and sum to create a mask to identify rows where number of non-nan values less than 2 and get their index level 1 values. Finall, use df.query to slice rows

a = df.notna().sum(1).lt(2).loc[lambda x: x].index.get_level_values(1)
df_final = df.query('ilevel_1 not in @a')

Out[275]:
         2001  2002  2003  2004
bob  A  123.0  31.0   4.0  12.0
     B   41.0   1.0  56.0  13.0
bill A  451.0   8.0   NaN  24.0
     B   32.0   5.0  52.0   6.0

Method 2:
Use notna, sum, groupby and transform to create mask True on groups having non-nan values greater than or equal 2. Finally, use this mask to slice rows

m = df.notna().sum(1).groupby(level=1).transform(lambda x: x.ge(2).all())
df_final = df[m]

Out[296]:
         2001  2002  2003  2004
bob  A  123.0  31.0   4.0  12.0
     B   41.0   1.0  56.0  13.0
bill A  451.0   8.0   NaN  24.0
     B   32.0   5.0  52.0   6.0



回答2:


Keep only the rows with at least 5 non-NA values.

df.dropna(thresh=5)

thresh is for including rows with a minimum number of non-NaN



来源:https://stackoverflow.com/questions/59593901/python-drop-all-instances-of-feature-from-df-if-nan-thresh-is-met

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!