问题
I am preprocessing a timeseries dataset changing its shape from 2-dimensions (datapoints, features) into a 3-dimensions (datapoints, time_window, features).
In such perspective time windows (sometimes also called look back) indicates the number of previous time steps/datapoints that are involved as input variables to predict the next time period. In other words time windows is how much data in past the machine learning algorithm takes into consideration for a single prediction in the future.
The issue with such approach (or at least with my implementation) is that it is quite inefficient in terms of memory usage since it brings data redundancy across the windows causing the input data to become very heavy.
This is the function that I have been using so far to reshape the input data into a 3 dimensional structure.
from sys import getsizeof
def time_framer(data_to_frame, window_size=1):
"""It transforms a 2d dataset into 3d based on a specific size;
original function can be found at:
https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/
"""
n_datapoints = data_to_frame.shape[0] - window_size
framed_data = np.empty(
shape=(n_datapoints, window_size, data_to_frame.shape[1],)).astype(np.float32)
for index in range(n_datapoints):
framed_data[index] = data_to_frame[index:(index + window_size)]
print(framed_data.shape)
# it prints the size of the output in MB
print(framed_data.nbytes / 10 ** 6)
print(getsizeof(framed_data) / 10 ** 6)
# quick and dirty quality test to check if the data has been correctly reshaped
test1=list(set(framed_data[0][1]==framed_data[1][0]))
if test1[0] and len(test1)==1:
print('Data is correctly framed')
return framed_data
I have been suggested to use numpy's strides trick to overcome such problem and reduce the size of the reshaped data. Unfortunately, any resource I found so far on this subject is focused on implementing the trick on a 2 dimensional array, just as this excellent tutorial. I have been struggling with my use case which involves a 3 dimensional output. Here is the best I came out with; however, it neither succeeds in reducing the size of the framed_data, nor it frames the data correctly as it does not pass the quality test.
I am quite sure that my error is on the strides parameter which I did not fully understood. The new_strides are the only values I managed to successfully feed to as_strided.
from numpy.lib.stride_tricks import as_strided
def strides_trick_time_framer(data_to_frame, window_size=1):
new_strides = (data_to_frame.strides[0],
data_to_frame.strides[0]*data_to_frame.shape[1] ,
data_to_frame.strides[0]*window_size)
n_datapoints = data_to_frame.shape[0] - window_size
print('striding.....')
framed_data = as_strided(data_to_frame,
shape=(n_datapoints, # .flatten() here did not change the outcome
window_size,
data_to_frame.shape[1]),
strides=new_strides).astype(np.float32)
# it prints the size of the output in MB
print(framed_data.nbytes / 10 ** 6)
print(getsizeof(framed_data) / 10 ** 6)
# quick and dirty test to check if the data has been correctly reshaped
test1=list(set(framed_data[0][1]==framed_data[1][0]))
if test1[0] and len(test1)==1:
print('Data is correctly framed')
return framed_data
Any help would be highly appreciated!
回答1:
For this X
:
In [734]: X = np.arange(24).reshape(8,3)
In [735]: X.strides
Out[735]: (24, 8)
this as_strided
produces the same array as your time_framer
In [736]: np.lib.stride_tricks.as_strided(X,
shape=(X.shape[0]-3, 3, X.shape[1]),
strides=(24, 24, 8))
Out[736]:
array([[[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8]],
[[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11]],
[[ 6, 7, 8],
[ 9, 10, 11],
[12, 13, 14]],
[[ 9, 10, 11],
[12, 13, 14],
[15, 16, 17]],
[[12, 13, 14],
[15, 16, 17],
[18, 19, 20]]])
It strides the last dimension just like X
. And 2nd to the last as well. The first advances one row, so it too gets X.strides[0]
. So the window size only affects the shape, not the strides.
So in your as_strided
version just use:
new_strides = (data_to_frame.strides[0],
data_to_frame.strides[0] ,
data_to_frame.strides[1])
Minor corrections. Set the default window size to 2 or larger. 1 produces an indexing error in the test.
framed_data[0,1]==framed_data[1,0]
Looking a getsizeof
:
In [754]: sys.getsizeof(X)
Out[754]: 112
In [755]: X.nbytes
Out[755]: 192
Wait, why is X
size smaller than nbytes
? Because it is a view
(see line [734] above).
In [756]: sys.getsizeof(X.copy())
Out[756]: 304
As noted in another SO, getsizeof
has to be used with caution:
Why the size of numpy array is different?
Now for the expanded copy:
In [757]: x2=time_framer(X,4)
...
In [758]: x2.strides
Out[758]: (96, 24, 8)
In [759]: x2.nbytes
Out[759]: 384
In [760]: sys.getsizeof(x2)
Out[760]: 512
and the strided version
In [761]: x1=strides_trick_time_framer(X,4)
...
In [762]: x1.strides
Out[762]: (24, 24, 8)
In [763]: sys.getsizeof(x1)
Out[763]: 128
In [764]: x1.astype(int).strides
Out[764]: (96, 24, 8)
In [765]: sys.getsizeof(x1.astype(int))
Out[765]: 512
x1
size is just like a view (128 because its 3d). But if we try to change its dtype
, it makes a copy, and the strides and size are the same as x2
.
Many operations on x1
will loose the strided size advantage, x1.ravel()
, x1+1
etc. It's mainly reduction operations like mean
and sum
that produce a real space savings.
回答2:
You can use the stride template function window_nd
I made here
Then to stride over just the first dimension you just need
framed_data = window_nd(data_to_frame, window_size, axis = 0)
Haven't found a built-in window function yet that can work over arbitrary axes, so unless there's been a new one implemented in scipy.signal
or skimage
recently, that's probably your best bet.
EDIT: To see the memory savings, you will need to use the method described by @ali_m here as the basic ndarray.nbytes
is naive to shared memory.
def find_base_nbytes(obj):
if obj.base is not None:
return find_base_nbytes(obj.base)
return obj.nbytes
来源:https://stackoverflow.com/questions/52149479/time-series-data-preprocessing-numpy-strides-trick-to-save-memory