问题
I have a dataframe of user connections where UID represents a user, and date represents the date on which the user made connections (represented by #fans).
UID Date #fans
9305 1/25/2015 5
9305 2/26/2015 7
9305 3/27/2015 8
9305 4/1/2015 9
1305 6/6/2015 14
1305 6/26/2015 16
1305 6/27/2015 17
The date range of the dataframe is 01-01-2014 to 12-01-2020.
I need to expand the data such that for each user the date should contain each date in the date range and each date should have #fans as total connections of the user till that date. e.g. The desired output is
UID Date #fans
9305 1/1/2014 0
9305 1/2/2014 0
9305 1/3/2014 0
...
9305 1/25/2015 5
9305 1/26/2015 5
9305 1/27/2015 5
...
9305 2/26/2015 7
9305 3/27/2015 8
9305 3/28/2015 8
9305 3/29/2015 8
...
9305 4/1/2015 9
...
9305 12/1/2020 9
*for all the UIDs
I am unsure about what approach should I take here. Any help is appreciated.
回答1:
The below code should give you the desired results.
Step 1: Create a pd.Series of date range between 01-01-2014 and 12-01-2020.
datelist = pd.date_range(start='01-01-2014', end='12-01-2020', freq='1d')
Step 2: Get the length of the date range. In our case, it is 2527.
nd = len(datelist)
Step 3: Get the length of the unique values of UIDs in the original dataframe. In the example, we have 2
nu = len(df['UID'].unique())
Step 4: Create a DataFrame of two columns - UID and Date for the full range (2527 x 2)
df_final = pd.DataFrame({'UID':df['UID'].unique().tolist()*nd, 'Date':np.repeat(datelist,nu)})
Step 5: Now merge the original dataframe to
df_final
so you can get the specific value assigned to #fans.df_final = df_final.merge(df, how='left')
Step 6: Now you have all the data you need. Sort the data first by UID and then by Date
df_final.sort_values(['UID','Date'],inplace=True)
Step 7: Now that we have the data in the specific order, we need to forward fill the values. That will ensure all the data continues to get propagated to the below rows.
Step 8: Next we need to replace all NaNs with 0
df_final['#fans'] = df_final['#fans'].ffill().fillna(0)
Step 9: Finally, we change the dtype of #fans to int
df_final['#fans'] = df_final['#fans'].astype('int64')
Putting all this together, here's the code:
import pandas as pd
import numpy as np
from datetime import datetime
c = ['UID','Date','#fans']
d = [[9305, '1/25/2015', 5],
[9305, '2/26/2015', 7],
[9305, '3/27/2015', 8],
[9305, '4/1/2015', 9],
[1305, '6/6/2015', 14],
[1305, '6/26/2015', 16],
[1305, '6/27/2015', 17]]
df = pd.DataFrame(d,columns=c)
df.Date = pd.to_datetime(df.Date)
print (df)
datelist = pd.date_range(start='01-01-2014', end='12-01-2020', freq='1d')
nd = len(datelist)
nu = len(df['UID'].unique())
df_final = pd.DataFrame({'UID':df['UID'].unique().tolist()*nd,
'Date':np.repeat(datelist,nu)})
df_final = df_final.merge(df, how='left')
df_final.sort_values(['UID','Date'],inplace=True)
df_final['#fans'] = df_final['#fans'].ffill()
df_final['#fans'] = df_final['#fans'].astype('int64')
print (df_final)
The output of this will be:
UID Date #fans
1 1305 2014-01-01 0
3 1305 2014-01-02 0
5 1305 2014-01-03 0
7 1305 2014-01-04 0
9 1305 2014-01-05 0
... ... ... ...
5044 9305 2020-11-27 9
5046 9305 2020-11-28 9
5048 9305 2020-11-29 9
5050 9305 2020-11-30 9
5052 9305 2020-12-01 9
The above one does not take into account the shift from one UID to the other. The value from previous UID will get carried over to the next UID. As Ferris had commented, we need to take care of the groupby UID.
Please use this to ensure the #fans count is retained within each group. Replace Steps 6, 7, and 8 with the below given groupby.
#here I am grouping by UID and forward filling the rows
#And if they are NA, I am setting the value to 0
df_final[['Date','#fans']] = df_final.groupby('UID')[['Date','#fans']].ffill().fillna(0)
The above code will ensure the following:
UID Date #fans
2526 1305 2020-12-01 17
UID Date #fans
2527 9305 2014-01-01 0
The previous code would have replaced row # 2527 (9305, 2014-01-01) with 17 as it will forward fill into next group. The above code will prevent it from happening.
#We need to convert #fans to int, so please make sure you use this
#otherwise values will be in float with xx.0
df_final['#fans'] = df_final['#fans'].astype('int64')
来源:https://stackoverflow.com/questions/66039611/expand-dataframe-for-each-date-pandas