Time series data preprocessing - numpy strides trick to save memory

问题

I am preprocessing a timeseries dataset changing its shape from 2-dimensions (datapoints, features) into a 3-dimensions (datapoints, time_window, features).

In such perspective time windows (sometimes also called look back) indicates the number of previous time steps/datapoints that are involved as input variables to predict the next time period. In other words time windows is how much data in past the machine learning algorithm takes into consideration for a single prediction in the future.

The issue with such approach (or at least with my implementation) is that it is quite inefficient in terms of memory usage since it brings data redundancy across the windows causing the input data to become very heavy.

This is the function that I have been using so far to reshape the input data into a 3 dimensional structure.

from sys import getsizeof

def time_framer(data_to_frame, window_size=1):
    """It transforms a 2d dataset into 3d based on a specific size;
    original function can be found at:
    https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/
    """
    n_datapoints = data_to_frame.shape[0] - window_size
    framed_data = np.empty(
        shape=(n_datapoints, window_size, data_to_frame.shape[1],)).astype(np.float32)

    for index in range(n_datapoints):
        framed_data[index] = data_to_frame[index:(index + window_size)]
        print(framed_data.shape)

    # it prints the size of the output in MB
    print(framed_data.nbytes / 10 ** 6)
    print(getsizeof(framed_data) / 10 ** 6)

    # quick and dirty quality test to check if the data has been correctly reshaped        
    test1=list(set(framed_data[0][1]==framed_data[1][0]))
    if test1[0] and len(test1)==1:
        print('Data is correctly framed')

    return framed_data

I have been suggested to use numpy's strides trick to overcome such problem and reduce the size of the reshaped data. Unfortunately, any resource I found so far on this subject is focused on implementing the trick on a 2 dimensional array, just as this excellent tutorial. I have been struggling with my use case which involves a 3 dimensional output. Here is the best I came out with; however, it neither succeeds in reducing the size of the framed_data, nor it frames the data correctly as it does not pass the quality test.

I am quite sure that my error is on the strides parameter which I did not fully understood. The new_strides are the only values I managed to successfully feed to as_strided.

from numpy.lib.stride_tricks import as_strided

def strides_trick_time_framer(data_to_frame, window_size=1):

    new_strides = (data_to_frame.strides[0],
                   data_to_frame.strides[0]*data_to_frame.shape[1] ,
                   data_to_frame.strides[0]*window_size)

    n_datapoints = data_to_frame.shape[0] - window_size
    print('striding.....')
    framed_data = as_strided(data_to_frame, 
                             shape=(n_datapoints, # .flatten() here did not change the outcome
                                    window_size,
                                    data_to_frame.shape[1]),                   
                                    strides=new_strides).astype(np.float32)
    # it prints the size of the output in MB
    print(framed_data.nbytes / 10 ** 6)
    print(getsizeof(framed_data) / 10 ** 6)

    # quick and dirty test to check if the data has been correctly reshaped        
    test1=list(set(framed_data[0][1]==framed_data[1][0]))
    if test1[0] and len(test1)==1:
        print('Data is correctly framed')

    return framed_data

Any help would be highly appreciated!

回答1:

For this X:

In [734]: X = np.arange(24).reshape(8,3)
In [735]: X.strides
Out[735]: (24, 8)

this as_strided produces the same array as your time_framer

In [736]: np.lib.stride_tricks.as_strided(X, 
            shape=(X.shape[0]-3, 3, X.shape[1]), 
            strides=(24, 24, 8))
Out[736]: 
array([[[ 0,  1,  2],
        [ 3,  4,  5],
        [ 6,  7,  8]],

       [[ 3,  4,  5],
        [ 6,  7,  8],
        [ 9, 10, 11]],

       [[ 6,  7,  8],
        [ 9, 10, 11],
        [12, 13, 14]],

       [[ 9, 10, 11],
        [12, 13, 14],
        [15, 16, 17]],

       [[12, 13, 14],
        [15, 16, 17],
        [18, 19, 20]]])

It strides the last dimension just like X. And 2nd to the last as well. The first advances one row, so it too gets X.strides[0]. So the window size only affects the shape, not the strides.

So in your as_strided version just use:

 new_strides = (data_to_frame.strides[0],
                data_to_frame.strides[0] ,
                data_to_frame.strides[1])

Minor corrections. Set the default window size to 2 or larger. 1 produces an indexing error in the test.

framed_data[0,1]==framed_data[1,0]

Looking a getsizeof:

In [754]: sys.getsizeof(X)
Out[754]: 112
In [755]: X.nbytes
Out[755]: 192

Wait, why is X size smaller than nbytes? Because it is a view (see line [734] above).

In [756]: sys.getsizeof(X.copy())
Out[756]: 304

As noted in another SO, getsizeof has to be used with caution:

Why the size of numpy array is different?

Now for the expanded copy:

In [757]: x2=time_framer(X,4)
...
In [758]: x2.strides
Out[758]: (96, 24, 8)
In [759]: x2.nbytes
Out[759]: 384
In [760]: sys.getsizeof(x2)
Out[760]: 512

and the strided version

In [761]: x1=strides_trick_time_framer(X,4)
...
In [762]: x1.strides
Out[762]: (24, 24, 8)
In [763]: sys.getsizeof(x1)
Out[763]: 128
In [764]: x1.astype(int).strides
Out[764]: (96, 24, 8)
In [765]: sys.getsizeof(x1.astype(int))
Out[765]: 512

x1 size is just like a view (128 because its 3d). But if we try to change its dtype, it makes a copy, and the strides and size are the same as x2.

Many operations on x1 will loose the strided size advantage, x1.ravel(), x1+1 etc. It's mainly reduction operations like mean and sum that produce a real space savings.

回答2:

You can use the stride template function window_nd I made here

Then to stride over just the first dimension you just need

framed_data = window_nd(data_to_frame, window_size, axis = 0)

Haven't found a built-in window function yet that can work over arbitrary axes, so unless there's been a new one implemented in scipy.signal or skimage recently, that's probably your best bet.

EDIT: To see the memory savings, you will need to use the method described by @ali_m here as the basic ndarray.nbytes is naive to shared memory.

def find_base_nbytes(obj):
    if obj.base is not None:
        return find_base_nbytes(obj.base)
    return obj.nbytes

来源：https://stackoverflow.com/questions/52149479/time-series-data-preprocessing-numpy-strides-trick-to-save-memory

标签

python

numpy

data-structures

time-series