Expand dataframe for each date | Pandas

问题

I have a dataframe of user connections where UID represents a user, and date represents the date on which the user made connections (represented by #fans).

UID     Date        #fans
9305    1/25/2015   5
9305    2/26/2015   7
9305    3/27/2015   8
9305    4/1/2015    9
1305    6/6/2015    14
1305    6/26/2015   16
1305    6/27/2015   17

The date range of the dataframe is 01-01-2014 to 12-01-2020.

I need to expand the data such that for each user the date should contain each date in the date range and each date should have #fans as total connections of the user till that date. e.g. The desired output is

UID     Date        #fans
9305    1/1/2014    0
9305    1/2/2014    0
9305    1/3/2014    0
...
9305    1/25/2015   5
9305    1/26/2015   5
9305    1/27/2015   5
...
9305    2/26/2015   7
9305    3/27/2015   8
9305    3/28/2015   8
9305    3/29/2015   8
...
9305    4/1/2015    9
...
9305    12/1/2020   9

*for all the UIDs

I am unsure about what approach should I take here. Any help is appreciated.

回答1:

The below code should give you the desired results.

Step 1: Create a pd.Series of date range between 01-01-2014 and 12-01-2020.

datelist = pd.date_range(start='01-01-2014', end='12-01-2020', freq='1d')
Step 2: Get the length of the date range. In our case, it is 2527.

nd = len(datelist)
Step 3: Get the length of the unique values of UIDs in the original dataframe. In the example, we have 2

nu = len(df['UID'].unique())
Step 4: Create a DataFrame of two columns - UID and Date for the full range (2527 x 2)

df_final = pd.DataFrame({'UID':df['UID'].unique().tolist()*nd, 'Date':np.repeat(datelist,nu)})
Step 5: Now merge the original dataframe to df_final so you can get the specific value assigned to #fans.

df_final = df_final.merge(df, how='left')
Step 6: Now you have all the data you need. Sort the data first by UID and then by Date

df_final.sort_values(['UID','Date'],inplace=True)
Step 7: Now that we have the data in the specific order, we need to forward fill the values. That will ensure all the data continues to get propagated to the below rows.
Step 8: Next we need to replace all NaNs with 0

df_final['#fans'] = df_final['#fans'].ffill().fillna(0)
Step 9: Finally, we change the dtype of #fans to int

df_final['#fans'] = df_final['#fans'].astype('int64')

Putting all this together, here's the code:

import pandas as pd
import numpy as np
from datetime import datetime

c = ['UID','Date','#fans']
d = [[9305,    '1/25/2015',   5],
[9305,    '2/26/2015',  7],
[9305,    '3/27/2015',   8],
[9305,    '4/1/2015',    9],
[1305,    '6/6/2015',   14],
[1305,    '6/26/2015',   16],
[1305,    '6/27/2015',   17]]
df = pd.DataFrame(d,columns=c)
df.Date = pd.to_datetime(df.Date)
print (df)

datelist = pd.date_range(start='01-01-2014', end='12-01-2020', freq='1d')
nd = len(datelist)
nu = len(df['UID'].unique())

df_final = pd.DataFrame({'UID':df['UID'].unique().tolist()*nd,
                         'Date':np.repeat(datelist,nu)})

df_final = df_final.merge(df, how='left')

df_final.sort_values(['UID','Date'],inplace=True)
df_final['#fans'] = df_final['#fans'].ffill()
df_final['#fans'] = df_final['#fans'].astype('int64')
print (df_final)

The output of this will be:

       UID       Date  #fans
1     1305 2014-01-01      0
3     1305 2014-01-02      0
5     1305 2014-01-03      0
7     1305 2014-01-04      0
9     1305 2014-01-05      0
...    ...        ...    ...
5044  9305 2020-11-27      9
5046  9305 2020-11-28      9
5048  9305 2020-11-29      9
5050  9305 2020-11-30      9
5052  9305 2020-12-01      9

The above one does not take into account the shift from one UID to the other. The value from previous UID will get carried over to the next UID. As Ferris had commented, we need to take care of the groupby UID.

Please use this to ensure the #fans count is retained within each group. Replace Steps 6, 7, and 8 with the below given groupby.

#here I am grouping by UID and forward filling the rows
#And if they are NA, I am setting the value to 0

df_final[['Date','#fans']] = df_final.groupby('UID')[['Date','#fans']].ffill().fillna(0)

The above code will ensure the following:

       UID       Date  #fans
2526  1305 2020-12-01     17
       UID       Date  #fans
2527  9305 2014-01-01      0

The previous code would have replaced row # 2527 (9305, 2014-01-01) with 17 as it will forward fill into next group. The above code will prevent it from happening.

#We need to convert #fans to int, so please make sure you use this
#otherwise values will be in float with xx.0

df_final['#fans'] = df_final['#fans'].astype('int64')

来源：https://stackoverflow.com/questions/66039611/expand-dataframe-for-each-date-pandas

标签

pandas

dataframe