问题
I want to fill NaN in a df using 'mean' and 'interpolate' depending on at what time of the day the NaN occur. As you can see below, the first NaN occur at 6 am and the second NaN is at 8 am.
02/03/2016 05:00 8
02/03/2016 06:00 NaN
02/03/2016 07:00 1
02/03/2016 08:00 NaN
02/03/2016 09:00 3
My df consists of thousand of days. I want to apply 'ffill' for any NaN occur before 7 am and apply 'interpolate' for those occur after 7 am. My data is from 6 am to 6 pm.
My attempt is:
df_imputed = (df.between_time("00:00:00", "07:00:00", include_start=True, include_end=False)).ffill()
df_imputed = (df.between_time("07:00:00", "18:00:00", include_start=True, include_end=True)).interpolate()
But it cut my df down to the assigned time periods rather than filling the NaN as I want.
Edit: my df contains around 400 columns so the procedure will apply to all columns.
回答1:
Original question: single series of values
You can define a Boolean series according to your condition, then interpolate or ffill as appropriate via numpy.where:
# setup
df = pd.DataFrame({'date': ['02/03/2016 05:00', '02/03/2016 06:00', '02/03/2016 07:00',
'02/03/2016 08:00', '02/03/2016 09:00'],
'value': [8, np.nan, 1, np.nan, 3]})
df['date'] = pd.to_datetime(df['date'])
# construct Boolean switch series
switch = (df['date'] - df['date'].dt.normalize()) > pd.to_timedelta('07:00:00')
# use numpy.where to differentiate between two scenarios
df['value'] = np.where(switch, df['value'].interpolate(), df['value'].ffill())
print(df)
date value
0 2016-02-03 05:00:00 8.0
1 2016-02-03 06:00:00 8.0
2 2016-02-03 07:00:00 1.0
3 2016-02-03 08:00:00 2.0
4 2016-02-03 09:00:00 3.0
Updated question: multiple series of values
With multiple value columns, you can adjust the above solution using pd.DataFrame.where and iloc. Or, instead of iloc, you can use loc or other means (e.g. filter) of selecting columns:
# setup
df = pd.DataFrame({'date': ['02/03/2016 05:00', '02/03/2016 06:00', '02/03/2016 07:00',
'02/03/2016 08:00', '02/03/2016 09:00'],
'value': [8, np.nan, 1, np.nan, 3],
'value2': [3, np.nan, 2, np.nan, 6]})
df['date'] = pd.to_datetime(df['date'])
# construct Boolean switch series
switch = (df['date'] - df['date'].dt.normalize()) > pd.to_timedelta('07:00:00')
# use numpy.where to differentiate between two scenarios
df.iloc[:, 1:] = df.iloc[:, 1:].interpolate().where(switch, df.iloc[:, 1:].ffill())
print(df)
date value value2
0 2016-02-03 05:00:00 8.0 3.0
1 2016-02-03 06:00:00 8.0 3.0
2 2016-02-03 07:00:00 1.0 2.0
3 2016-02-03 08:00:00 2.0 4.0
4 2016-02-03 09:00:00 3.0 6.0
来源:https://stackoverflow.com/questions/53697353/filling-nan-by-ffill-and-interpolate-depending-on-time-of-the-day-of-nan-occ