Python re-sampling time series data which can not be indexed

问题

The purpose of this question is to know how many trades "happened" in each second (count) as well as the total volume traded (sum).

I have time series data which can not be indexed (as there are multiply entries with the same time-stamp - can get many trades on the same millisecond) and therefor the use of resample as explained here can not work.

Another approach was to first to do group by time as shown here (and later to resample per seconds). The problem is that grouping will cause only one elementary arithmetic over the grouped items (I can only sum/mean/std etc.) while in this data, I would need for 'tradeVolume' column to be grouped by sum while column 'ask1' to be grouped by mean.

So my question is/are: 1. how to group by with different arithmetic per column and if not possible is there any other way to resample the milliseconds data into seconds without the datetime index.

Thanks!

The time series (sample) is here:

SecurityID,dateTime,ask1,ask1Volume,bid1,bid1Volume,ask2,ask2Volume,bid2,bid2Volume,ask3,ask3Volume,bid3,bid3Volume,tradePrice,tradeVolume,isTrade
2318276,2017-11-20 08:00:09.052240,12869.0,1,12868.0,3,12870.0,19,12867.5,2,12872.5,2,12867.0,1,0.0,0,0
2318276,2017-11-20 08:00:09.052260,12869.0,1,12868.0,3,12870.0,19,12867.5,2,12872.5,2,12867.0,1,12861.0,1,1
2318276,2017-11-20 08:00:09.052260,12869.0,1,12868.0,2,12870.0,19,12867.5,2,12872.5,2,12867.0,1,12868.0,1,0
2318276,2017-11-20 08:00:09.052270,12869.0,1,12868.0,2,12870.0,19,12867.5,2,12872.5,2,12867.0,1,12868.0,1,1
2318276,2017-11-20 08:00:09.052270,12869.0,1,12868.0,1,12870.0,19,12867.5,2,12872.5,2,12867.0,1,12868.0,1,0
2318276,2017-11-20 08:00:09.052282,12869.0,1,12868.0,1,12870.0,19,12867.5,2,12872.5,2,12867.0,1,12868.0,1,1
2318276,2017-11-20 08:00:09.052282,12869.0,1,12867.5,2,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12868.0,1,0
2318276,2017-11-20 08:00:09.052291,12869.0,1,12867.5,2,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12867.5,1,1
2318276,2017-11-20 08:00:09.052291,12869.0,1,12867.5,1,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12867.5,1,0
2318276,2017-11-20 08:00:09.052315,12869.0,1,12867.5,1,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12867.5,1,1
2318276,2017-11-20 08:00:09.052315,12869.0,1,12867.5,1,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12867.0,1,1
2318276,2017-11-20 08:00:09.052315,12869.0,1,12867.5,1,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12865.5,1,1
2318276,2017-11-20 08:00:09.052315,12869.0,1,12867.5,1,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12865.0,1,1
2318276,2017-11-20 08:00:09.052315,12869.0,1,12867.5,1,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12864.0,1,1
2318276,2017-11-20 08:00:09.052315,12869.0,1,12861.5,2,12870.0,19,12861.0,1,12872.5,2,12860.0,5,12864.0,1,0
2318276,2017-11-20 08:00:09.052335,12869.0,1,12861.5,2,12870.0,19,12861.0,1,12872.5,2,12860.0,5,12861.5,1,1
2318276,2017-11-20 08:00:09.052335,12869.0,1,12861.5,1,12870.0,19,12861.0,1,12872.5,2,12860.0,5,12861.5,1,0
2318276,2017-11-20 08:00:09.052348,12869.0,1,12861.5,1,12870.0,19,12861.0,1,12872.5,2,12860.0,5,12861.5,1,1
2318276,2017-11-20 08:00:09.052348,12869.0,1,12861.0,1,12870.0,19,12860.0,5,12872.5,2,12859.5,3,12861.5,1,0
2318276,2017-11-20 08:00:09.052357,12869.0,1,12861.0,1,12870.0,19,12860.0,5,12872.5,2,12859.5,3,12861.0,1,1
2318276,2017-11-20 08:00:09.052357,12869.0,1,12860.0,5,12870.0,19,12859.5,3,12872.5,2,12858.0,1,12861.0,1,0

回答1:

~~First you need to have a column for the seconds (since epoch), then groupby using that column, and then do an aggregation on the columns you want.~~

You want to floor the timestamp down to one second accuracy, and group using that. Then apply an aggregation to get the mean/sum/std what ever you need

df = pd.read_csv('data.csv')
df['dateTime'] = df['dateTime'].astype('datetime64[s]')
groups = df.groupby('dateTime')
groups.agg({'ask1': np.mean, 'tradeVolume': np.sum})

I modified the data to make sure there are actually different seconds in it,

SecurityID,dateTime,ask1,ask1Volume,bid1,bid1Volume,ask2,ask2Volume,bid2,bid2Volume,ask3,ask3Volume,bid3,bid3Volume,tradePrice,tradeVolume,isTrade
2318276,2017-11-20 08:00:09.052240,12869.0,1,12868.0,3,12870.0,19,12867.5,2,12872.5,2,12867.0,1,0.0,0,0
2318276,2017-11-20 08:00:09.052260,12869.0,1,12868.0,3,12870.0,19,12867.5,2,12872.5,2,12867.0,1,12861.0,1,1
2318276,2017-11-20 08:00:09.052260,12869.0,1,12868.0,2,12870.0,19,12867.5,2,12872.5,2,12867.0,1,12868.0,1,0
2318276,2017-11-20 08:00:09.052270,12869.0,1,12868.0,2,12870.0,19,12867.5,2,12872.5,2,12867.0,1,12868.0,1,1
2318276,2017-11-20 08:00:09.052270,12869.0,1,12868.0,1,12870.0,19,12867.5,2,12872.5,2,12867.0,1,12868.0,1,0
2318276,2017-11-20 08:00:09.052282,12869.0,1,12868.0,1,12870.0,19,12867.5,2,12872.5,2,12867.0,1,12868.0,1,1
2318276,2017-11-20 08:00:09.052282,12869.0,1,12867.5,2,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12868.0,1,0
2318276,2017-11-20 08:00:09.052291,12869.0,1,12867.5,2,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12867.5,1,1
2318276,2017-11-20 08:00:09.052291,12869.0,1,12867.5,1,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12867.5,1,0
2318276,2017-11-20 08:00:09.052315,12869.0,1,12867.5,1,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12867.5,1,1
2318276,2017-11-20 08:00:09.052315,12869.0,1,12867.5,1,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12867.0,1,1
2318276,2017-11-20 08:00:10.052315,12869.0,1,12867.5,1,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12865.5,1,1
2318276,2017-11-20 08:00:10.052315,12869.0,1,12867.5,1,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12865.0,1,1
2318276,2017-11-20 08:00:10.052315,12869.0,1,12867.5,1,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12864.0,1,1
2318276,2017-11-20 08:00:10.052315,12869.0,1,12861.5,2,12870.0,19,12861.0,1,12872.5,2,12860.0,5,12864.0,1,0
2318276,2017-11-20 08:00:10.052335,12869.0,1,12861.5,2,12870.0,19,12861.0,1,12872.5,2,12860.0,5,12861.5,1,1
2318276,2017-11-20 08:00:10.052335,12869.0,1,12861.5,1,12870.0,19,12861.0,1,12872.5,2,12860.0,5,12861.5,1,0
2318276,2017-11-20 08:00:10.052348,12869.0,1,12861.5,1,12870.0,19,12861.0,1,12872.5,2,12860.0,5,12861.5,1,1
2318276,2017-11-20 08:00:10.052348,12869.0,1,12861.0,1,12870.0,19,12860.0,5,12872.5,2,12859.5,3,12861.5,1,0
2318276,2017-11-20 08:00:10.052357,12869.0,1,12861.0,1,12870.0,19,12860.0,5,12872.5,2,12859.5,3,12861.0,1,1
2318276,2017-11-20 08:00:10.052357,12869.0,1,12860.0,5,12870.0,19,12859.5,3,12872.5,2,12858.0,1,12861.0,1,0

and the output

In [53]: groups.agg({'ask1': np.mean, 'tradeVolume': np.sum})
Out[53]: 
               ask1  tradeVolume
seconds                         
1511164809  12869.0           10
1511164810  12869.0           10

footnote

OP said that the original version (below) was faster, so I ran some timings

def test1(df):
    """This is the fastest and cleanest."""
    df['dateTime'] = df['dateTime'].astype('datetime64[s]')
    groups = df.groupby('dateTime')
    agg = groups.agg({'ask1': np.mean, 'tradeVolume': np.sum})

def test2(df):
    """Totally unnecessary amount of datetime floors."""
    def group_by_second(index_loc):
        return df.loc[index_loc, 'dateTime'].floor('S')
    df['dateTime'] = df['dateTime'].astype('datetime64[ns]')
    groups = df.groupby(group_by_second)
    result = groups.agg({'ask1': np.mean, 'tradeVolume': np.sum})

def test3(df):
    """Original version, but the conversion to/from nanoseconds is unnecessary."""
    df['dateTime'] = df['dateTime'].astype('datetime64[ns]')
    df['seconds'] = df['dateTime'].apply(lambda v: v.value // 1e9)
    groups = df.groupby('dateTime')
    agg = groups.agg({'ask1': np.mean, 'tradeVolume': np.sum})

if __name__ == '__main__':
    import timeit
    print('22 rows')
    df = pd.read_csv('data_small.csv')
    print('test1', timeit.repeat("test1(df.copy())", number=50, globals=globals()))
    print('test2', timeit.repeat("test2(df.copy())", number=50, globals=globals()))
    print('test3', timeit.repeat("test3(df.copy())", number=50, globals=globals()))

    print('220 rows')
    df = pd.read_csv('data.csv')
    print('test1', timeit.repeat("test1(df.copy())", number=50, globals=globals()))
    print('test2', timeit.repeat("test2(df.copy())", number=50, globals=globals()))
    print('test3', timeit.repeat("test3(df.copy())", number=50, globals=globals()))

I tested those on two datasets one 10 times the size of the first one, the results

22 rows
test1 [0.08138518501073122, 0.07786444900557399, 0.0775048139039427]
test2 [0.2644687460269779, 0.26298125297762454, 0.2618108610622585]
test3 [0.10624988097697496, 0.1028324980288744, 0.10304366517812014]
220 rows
test1 [0.07999306707642972, 0.07842653687112033, 0.07848454895429313]
test2 [1.9794962559826672, 1.966513831866905, 1.9625889619346708]
test3 [0.12691736104898155, 0.12642419710755348, 0.126510804053396]

So, best to use the .astype('datetime[s]') version as that is the fastest and scales the best.

来源：https://stackoverflow.com/questions/48379224/python-re-sampling-time-series-data-which-can-not-be-indexed

标签

python

pandas

pandas-groupby