I have a dataframe like shown below
df2 = pd.DataFrame({\'person_id\':[1],\'H1_date\' : [\'2006-10-30 00:00:00\'], \'H1\':[2.3],\'H2_date\' : [\'2016-10-30
On approach is to melt the DF, apply a key that identifies columns in the same "group" (in this case H<some digits>
but you can amend that as required), then group by person and that key, filter those groups to those containing at least one non-NA value), eg:
Starting with:
df = pd.DataFrame({'person_id':[1],'H1_date' : ['2006-10-30 00:00:00'], 'H1':[2.3],'H2_date' : ['2016-10-30 00:00:00'], 'H2':[12.3],'H3_date' : ['2026-11-30 00:00:00'], 'H3':[22.3],'H4_date' : ['2106-10-30 00:00:00'], 'H4':[42.3],'H5_date' : [np.nan], 'H5':[np.nan],'H6_date' : ['2006-10-30 00:00:00'], 'H6':[2.3],'H7_date' : [np.nan], 'H7':[2.3],'H8_date' : ['2006-10-30 00:00:00'], 'H8':[np.nan]})
Use:
df2 = (
df.melt(id_vars='person_id')
.assign(_gid=lambda v: v.variable.str.extract('H(\d+)'))
.groupby(['person_id', '_gid'])
.filter(lambda g: bool(g.value.any()))
.drop('_gid', 1)
)
Which gives you:
person_id variable value
0 1 H1_date 2006-10-30 00:00:00
1 1 H1 2.3
2 1 H2_date 2016-10-30 00:00:00
3 1 H2 12.3
4 1 H3_date 2026-11-30 00:00:00
5 1 H3 22.3
6 1 H4_date 2106-10-30 00:00:00
7 1 H4 42.3
10 1 H6_date 2006-10-30 00:00:00
11 1 H6 2.3
12 1 H7_date NaN
13 1 H7 2.3
14 1 H8_date 2006-10-30 00:00:00
15 1 H8 NaN
You can then use that as a starting point to tweak if necessary.
You can use :
col = [x for x in df.columns if "date" in x]
for column in col:
df.dropna(subset=[column,column[:-4]], how = 'all',inplace=True)
subset
will select the lines where the NA is detected, how
specifies the conditions on the line (here all the of the 2 lines must be NA) and inplace
modifies the current DataFrame
try pd.DataFrame.melt
df = pd.melt(df2, id_vars='person_id', var_name='col', value_name='dates')
df['col2'] = df['col'].str.split("_").str[0]
df['count'] = df.groupby(['col2'])['dates'].transform(pd.Series.count)
df = df[df['count'] != 0]
df.drop(['col2', 'count'], axis=1, inplace=True)
print(df)
person_id col dates
0 1 H1_date 2006-10-30 00:00:00
1 1 H1 2.3
2 1 H2_date 2016-10-30 00:00:00
3 1 H2 12.3
4 1 H3_date 2026-11-30 00:00:00
5 1 H3 22.3
6 1 H4_date 2106-10-30 00:00:00
7 1 H4 42.3
10 1 H6_date 2006-10-30 00:00:00
11 1 H6 2.3
12 1 H7_date NaN
13 1 H7 2.3
14 1 H8_date 2006-10-30 00:00:00
15 1 H8 NaN