Filling NaN by 'ffill' and 'interpolate' depending on time of the day of NaN occurrence in Python

血红的双手。 提交于 2019-12-24 15:42:47

问题


I want to fill NaN in a df using 'mean' and 'interpolate' depending on at what time of the day the NaN occur. As you can see below, the first NaN occur at 6 am and the second NaN is at 8 am.

02/03/2016 05:00    8
02/03/2016 06:00    NaN
02/03/2016 07:00    1
02/03/2016 08:00    NaN
02/03/2016 09:00    3

My df consists of thousand of days. I want to apply 'ffill' for any NaN occur before 7 am and apply 'interpolate' for those occur after 7 am. My data is from 6 am to 6 pm.

My attempt is:

df_imputed = (df.between_time("00:00:00", "07:00:00", include_start=True, include_end=False)).ffill()
df_imputed = (df.between_time("07:00:00", "18:00:00", include_start=True, include_end=True)).interpolate()   

But it cut my df down to the assigned time periods rather than filling the NaN as I want.

Edit: my df contains around 400 columns so the procedure will apply to all columns.


回答1:


Original question: single series of values

You can define a Boolean series according to your condition, then interpolate or ffill as appropriate via numpy.where:

# setup
df = pd.DataFrame({'date': ['02/03/2016 05:00', '02/03/2016 06:00', '02/03/2016 07:00',
                            '02/03/2016 08:00', '02/03/2016 09:00'],
                   'value': [8, np.nan, 1, np.nan, 3]})
df['date'] = pd.to_datetime(df['date'])

# construct Boolean switch series
switch = (df['date'] - df['date'].dt.normalize()) > pd.to_timedelta('07:00:00')

# use numpy.where to differentiate between two scenarios
df['value'] = np.where(switch, df['value'].interpolate(), df['value'].ffill())

print(df)

                 date  value
0 2016-02-03 05:00:00    8.0
1 2016-02-03 06:00:00    8.0
2 2016-02-03 07:00:00    1.0
3 2016-02-03 08:00:00    2.0
4 2016-02-03 09:00:00    3.0

Updated question: multiple series of values

With multiple value columns, you can adjust the above solution using pd.DataFrame.where and iloc. Or, instead of iloc, you can use loc or other means (e.g. filter) of selecting columns:

# setup
df = pd.DataFrame({'date': ['02/03/2016 05:00', '02/03/2016 06:00', '02/03/2016 07:00',
                            '02/03/2016 08:00', '02/03/2016 09:00'],
                   'value': [8, np.nan, 1, np.nan, 3],
                   'value2': [3, np.nan, 2, np.nan, 6]})
df['date'] = pd.to_datetime(df['date'])

# construct Boolean switch series
switch = (df['date'] - df['date'].dt.normalize()) > pd.to_timedelta('07:00:00')

# use numpy.where to differentiate between two scenarios
df.iloc[:, 1:] = df.iloc[:, 1:].interpolate().where(switch, df.iloc[:, 1:].ffill())

print(df)

                 date  value  value2
0 2016-02-03 05:00:00    8.0     3.0
1 2016-02-03 06:00:00    8.0     3.0
2 2016-02-03 07:00:00    1.0     2.0
3 2016-02-03 08:00:00    2.0     4.0
4 2016-02-03 09:00:00    3.0     6.0


来源:https://stackoverflow.com/questions/53697353/filling-nan-by-ffill-and-interpolate-depending-on-time-of-the-day-of-nan-occ

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!