Retain few NA's and drop rest of NA's during Stack operation in Python

前端 未结 3 1322
悲&欢浪女
悲&欢浪女 2020-12-04 04:14

I have a dataframe like shown below

df2 = pd.DataFrame({\'person_id\':[1],\'H1_date\' : [\'2006-10-30 00:00:00\'], \'H1\':[2.3],\'H2_date\' : [\'2016-10-30          


        
相关标签:
3条回答
  • 2020-12-04 04:45

    On approach is to melt the DF, apply a key that identifies columns in the same "group" (in this case H<some digits> but you can amend that as required), then group by person and that key, filter those groups to those containing at least one non-NA value), eg:

    Starting with:

    df = pd.DataFrame({'person_id':[1],'H1_date' : ['2006-10-30 00:00:00'], 'H1':[2.3],'H2_date' : ['2016-10-30 00:00:00'], 'H2':[12.3],'H3_date' : ['2026-11-30 00:00:00'], 'H3':[22.3],'H4_date' : ['2106-10-30 00:00:00'], 'H4':[42.3],'H5_date' : [np.nan], 'H5':[np.nan],'H6_date' : ['2006-10-30 00:00:00'], 'H6':[2.3],'H7_date' : [np.nan], 'H7':[2.3],'H8_date' : ['2006-10-30 00:00:00'], 'H8':[np.nan]})
    

    Use:

    df2 = (
        df.melt(id_vars='person_id')
        .assign(_gid=lambda v: v.variable.str.extract('H(\d+)'))
        .groupby(['person_id', '_gid'])
        .filter(lambda g: bool(g.value.any()))
        .drop('_gid', 1)
    )
    

    Which gives you:

        person_id variable                value
    0           1  H1_date  2006-10-30 00:00:00
    1           1       H1                  2.3
    2           1  H2_date  2016-10-30 00:00:00
    3           1       H2                 12.3
    4           1  H3_date  2026-11-30 00:00:00
    5           1       H3                 22.3
    6           1  H4_date  2106-10-30 00:00:00
    7           1       H4                 42.3
    10          1  H6_date  2006-10-30 00:00:00
    11          1       H6                  2.3
    12          1  H7_date                  NaN
    13          1       H7                  2.3
    14          1  H8_date  2006-10-30 00:00:00
    15          1       H8                  NaN
    

    You can then use that as a starting point to tweak if necessary.

    0 讨论(0)
  • 2020-12-04 04:52

    You can use :

    col = [x for x in df.columns if "date" in x] for column in col: df.dropna(subset=[column,column[:-4]], how = 'all',inplace=True)

    subset will select the lines where the NA is detected, how specifies the conditions on the line (here all the of the 2 lines must be NA) and inplace modifies the current DataFrame

    0 讨论(0)
  • 2020-12-04 04:59

    try pd.DataFrame.melt

    df = pd.melt(df2, id_vars='person_id', var_name='col', value_name='dates')
    df['col2'] = df['col'].str.split("_").str[0]
    df['count'] = df.groupby(['col2'])['dates'].transform(pd.Series.count)
    df = df[df['count'] != 0]
    df.drop(['col2', 'count'], axis=1, inplace=True)
    print(df)
    
        person_id      col                dates
    0           1  H1_date  2006-10-30 00:00:00
    1           1       H1                  2.3
    2           1  H2_date  2016-10-30 00:00:00
    3           1       H2                 12.3
    4           1  H3_date  2026-11-30 00:00:00
    5           1       H3                 22.3
    6           1  H4_date  2106-10-30 00:00:00
    7           1       H4                 42.3
    10          1  H6_date  2006-10-30 00:00:00
    11          1       H6                  2.3
    12          1  H7_date                  NaN
    13          1       H7                  2.3
    14          1  H8_date  2006-10-30 00:00:00
    15          1       H8                  NaN
    
    
    0 讨论(0)
提交回复
热议问题