Missing samples of a dataframe in pandas

偶尔善良 提交于 2019-12-25 18:42:38

问题


My df:

In [163]: df.head()
Out[163]: 
                       x-axis    y-axis    z-axis
time   
2017-07-27 06:23:08 -0.107666 -0.068848  0.963623
2017-07-27 06:23:08 -0.105225 -0.070068  0.963867
.....

I set the index as datetime. Since the sampling rate (10 Hz) is not always constant in the dataframe and for some second I have 8 or 9 samples.

  1. I would like to specify the milliseconds on my datatime (06:23:08**.100**, 06:23:08**.200**, etc.)
  2. I also would like to do interpolation of the missing samples.

Some ideas how to do it in pandas?


回答1:


First lets create some sample data which maybe resembles your data.

import pandas as pd
from datetime import timedelta
from datetime import datetime

base = datetime.now()
date_list = [base - timedelta(days=x) for x in range(0, 2)]
values = [v for v in range(2)]
df = pd.DataFrame.from_dict({'Date': date_list, 'values': values})

df = df.set_index('Date')
df

                           values
Date    
2017-08-18 20:42:08.563878  0
2017-08-17 20:42:08.563878  1

Now we will create another data frame with every 100 milliseconds of datapoint.

min_val = df.index.min()
max_val = df.index.max()

all_val = []
while min_val <= max_val:
    all_val.append(min_val)
    min_val += timedelta(milliseconds=100)
# len(all_val) 864001 
df_new = pd.DataFrame.from_dict({'Date': all_val})
df_new = df_new.set_index('Date')

lets join both data frame so all missing rows will have index but no values.

final_df = df_new.join(df)
final_df

                            values
Date    
2017-08-17 20:42:08.563878  1.0
2017-08-17 20:42:08.663878  NaN
2017-08-17 20:42:08.763878  NaN
2017-08-17 20:42:08.863878  NaN
2017-08-17 20:42:08.963878  NaN
2017-08-17 20:42:09.063878  NaN
2017-08-17 20:42:09.163878  NaN

Now interpolate data:

df_final.interpolate()

                            values
Date    
2017-08-17 20:42:08.563878  1.000000
2017-08-17 20:42:08.663878  0.999999
2017-08-17 20:42:08.763878  0.999998
2017-08-17 20:42:08.863878  0.999997
2017-08-17 20:42:08.963878  0.999995
2017-08-17 20:42:09.063878  0.999994
2017-08-17 20:42:09.163878  0.999993
2017-08-17 20:42:09.263878  0.999992

Some interpolation strategies: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.interpolate.html

UPDATE: As per the discussion in comments:

say our initial data does not have millisecond information.

df_new_date_without_miliseconds = df_new['Date']
df_new_date_without_miliseconds[0] # Timestamp('2017-08-17 21:45:49')

max_value_date = df_new_date_without_miliseconds[0]
max_value_miliseconds = df_new_date_without_miliseconds[0]

updated_dates = []
for val in df_new_date_without_miliseconds:
    if val == max_value_date:
        val = max_value_miliseconds + timedelta(milliseconds=100)
        max_value_miliseconds = val
    elif val > max_value_date:
        max_value_date = val + timedelta(milliseconds=0)
        max_value_miliseconds = val
    updated_dates.append(val)

output:

[Timestamp('2017-08-17 21:45:49.100000'),
 Timestamp('2017-08-17 21:45:49.200000'),
 Timestamp('2017-08-17 21:45:49.300000'),
 Timestamp('2017-08-17 21:45:50'),
 Timestamp('2017-08-17 21:45:50.100000'),

Assign the new values to the DataFrame

df_new['Date'] = updated_dates


来源:https://stackoverflow.com/questions/45759515/missing-samples-of-a-dataframe-in-pandas

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!