Transform wide to long but with repetition of a specific column

痞子三分冷 提交于 2019-12-07 23:44:57

问题


I have a dataframe as shown below

df2 = pd.DataFrame({'pid':[1,2,3,4],'BP1Date':['12/11/2016','12/21/2016','12/31/2026',np.nan],'BP1di':[21,24,25,np.nan],'BP1sy':[123,125,127,np.nan],'BP2Date':['12/31/2016','12/31/2016','12/31/2016','12/31/2016'],'BP2di':[21,26,28,30],'BP2sy':[123,130,135,145],
                   'BP3Date':['12/31/2017','12/31/2018','12/31/2019','12/31/2116'],'BP3di':[21,31,36,np.nan],'BP3sy':[123,126,145,np.nan]})

It looks like as shown below

I expect my output to be like as shown below

This is what I tried based on SO suggestions from other posts but I am unable to produce or get close to the expected output

df = pd.melt(df2, id_vars='pid', var_name='col', value_name='dates')
df['col2'] = [x.split("Date")[0][:3] for x in df['col']]
df = df[df.groupby(['pid','col2'])['dates'].transform('count').ne(0)].copy()
df['col3'] = df['col2'].str.extract('(\d+)', expand=True).astype(int)
df2 = df.sort_values(by=['pid','col3'])

Please note two things

a) For each date, I have two readings (BP{n}di, BP{n}si)

b) I would like to drop NA records only when it is NA for all 3 columns together (In this case for pid = 4, BP1Date, BP1di, BP1sy is NA). If any of the column is not NA, then the NA should be retained as shown below. Hence I didn't use stack(dropna=False) instead I am using pd.melt based on SO posts

How can I transform the input to achieve the output as shown above in screenshot?

updated screenshot based on Answer comments


回答1:


Use lreshape with DataFrame.stack for reshape, then remove missing values by Date column by DataFrame.dropna and sorting by first 3 columns:

a = [col for col in df2.columns if col.endswith('Date')]
b = [col for col in df2.columns if col.endswith('di')]
c = [col for col in df2.columns if col.endswith('sy')]

df1 = (pd.lreshape(df2, {'Date':a, 'di':b, 'sy':c}, dropna=False)
       .set_index(['pid','Date'])
       .stack(dropna=False)
       .rename_axis(['pid','Date','type'])
       .reset_index(name='value')
       .dropna(subset=['Date'])
       .assign(Date = lambda x: pd.to_datetime(x['Date'], dayfirst=True))
       .sort_values(['pid','Date','type'])
       .reset_index(drop=True)
       )

print (df1)
    pid       Date type  value
0     1 2016-11-12   di   21.0
1     1 2016-11-12   sy  123.0
2     1 2016-12-31   di   21.0
3     1 2016-12-31   sy  123.0
4     1 2017-12-31   di   21.0
5     1 2017-12-31   sy  123.0
6     2 2016-12-21   di   24.0
7     2 2016-12-21   sy  125.0
8     2 2016-12-31   di   26.0
9     2 2016-12-31   sy  130.0
10    2 2018-12-31   di   31.0
11    2 2018-12-31   sy  126.0
12    3 2016-12-31   di   28.0
13    3 2016-12-31   sy  135.0
14    3 2019-12-31   di   36.0
15    3 2019-12-31   sy  145.0
16    3 2026-12-31   di   25.0
17    3 2026-12-31   sy  127.0
18    4 2016-12-31   di   30.0
19    4 2016-12-31   sy  145.0
20    4 2116-12-31   di    NaN
21    4 2116-12-31   sy    NaN

Alternative solution is with MultiIndex in columns created by Series.str.extract and MultiIndex.from_tuples:

df2 = df2.set_index('pid')

c = df2.columns.to_frame(name='orig')
c = c['orig'].str.extract('(.+)(Date|di|sy)').apply(tuple, 1)

df2.columns = pd.MultiIndex.from_tuples(c)

df1 = (df2.stack(0)
       .set_index(['Date'], append=True)
       .reset_index(level=1, drop=True)
       .stack(dropna=False)
       .rename_axis(['pid','Date','type'])
       .reset_index(name='value')
       .dropna(subset=['Date'])
       .assign(Date = lambda x: pd.to_datetime(x['Date'], dayfirst=True))
       .sort_values(['pid','Date','type'])
       .reset_index(drop=True)
       )

print (df1)
    pid       Date type  value
0     1 2016-11-12   di   21.0
1     1 2016-11-12   sy  123.0
2     1 2016-12-31   di   21.0
3     1 2016-12-31   sy  123.0
4     1 2017-12-31   di   21.0
5     1 2017-12-31   sy  123.0
6     2 2016-12-21   di   24.0
7     2 2016-12-21   sy  125.0
8     2 2016-12-31   di   26.0
9     2 2016-12-31   sy  130.0
10    2 2018-12-31   di   31.0
11    2 2018-12-31   sy  126.0
12    3 2016-12-31   di   28.0
13    3 2016-12-31   sy  135.0
14    3 2019-12-31   di   36.0
15    3 2019-12-31   sy  145.0
16    3 2026-12-31   di   25.0
17    3 2026-12-31   sy  127.0
18    4 2016-12-31   di   30.0
19    4 2016-12-31   sy  145.0
20    4 2116-12-31   di    NaN
21    4 2116-12-31   sy    NaN


来源:https://stackoverflow.com/questions/57347377/transform-wide-to-long-but-with-repetition-of-a-specific-column

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!