machine learning-how to use the past 20 rows as an input for X for each Y value

社会主义新天地 提交于 2019-11-29 04:49:43

This is typically done with Recurrent Neural Networks (RNN), that retain some memory of the previous input, when the next input is received. Thats a very breif explanation of what goes on, but there are plenty of sources on the internet to better wrap your understanding of how they work.

Lets break this down in a simple example. Lets say you have 5 samples and 5 features of data, and you want two stagger the data by 2 rows instead of 20. Here is your data (assuming 1 stock and the oldest price value is first). And we can think of each row as a day of the week

ar = np.random.randint(10,100,(5,5))

[[43, 79, 67, 20, 13],    #<---Monday---
 [80, 86, 78, 76, 71],    #<---Tuesday---
 [35, 23, 62, 31, 59],    #<---Wednesday---
 [67, 53, 92, 80, 15],    #<---Thursday---
 [60, 20, 10, 45, 47]]    #<---Firday---

To use an LSTM in keras, your data needs to be 3-D, vs the current 2-D structure it is now, and the notation for each diminsion is (samples,timesteps,features). Currently you only have (samples,features) so you would need to augment the data.

a2 = np.concatenate([ar[x:x+2,:] for x in range(ar.shape[0]-1)])
a2 = a2.reshape(4,2,5)

[[[43, 79, 67, 20, 13],    #See Monday First
  [80, 86, 78, 76, 71]],   #See Tuesday second ---> Predict Value originally set for Tuesday
 [[80, 86, 78, 76, 71],    #See Tuesday First
  [35, 23, 62, 31, 59]],   #See Wednesday Second ---> Predict Value originally set for Wednesday
 [[35, 23, 62, 31, 59],    #See Wednesday Value First
  [67, 53, 92, 80, 15]],   #See Thursday Values Second ---> Predict value originally set for Thursday
 [[67, 53, 92, 80, 15],    #And so on
  [60, 20, 10, 45, 47]]])

Notice how the data is staggered and 3 dimensional. Now just make an LSTM network. Y remains 2-D since this is a many-to-one structure, however you need to clip the first value.

model = Sequential()
model.add(LSTM(hidden_dims,input_shape=(a2.shape[1],a2.shape[2]))
model.add(Dense(1))

This is just a brief example to get you moving. There are many different setups that will work (including not using RNN), you need to find the correct one for your data.

This seems to be a time series type of task.
I would start by looking at Recurrent Neural Networks keras

If you want to keep using the modeling you have. (I would not recommend) For time series you may want to transform your data set to some kind of weighted average of last 20 observations (rows).
This way, each of your new data set's observations is the function of the previous 20. This way, that information is present for classification.

You can use something like this for each column if you want the runing sum:

import numpy as np

def running_sum(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) 

x=np.random.rand(200)

print(running_sum(x,20))

Alternately, you could pivot your current data set so that each row has the actual numbers: Add 19 x dimension count columns. Populate with previous observation's data in those. Whether this is possible or practical depends on the shape of your data set.

This is a simple, not too thorough, way to make sure each observation has the data that you think will make a good prediction. You need to be aware of these things:

  1. The modelling method is 'ok' with this not absolute independence of observation.
  2. When you make the prediction for X[i], you have all the information from X[i-20] to X[i-1]

I'm sure there are other considerations that make this not optimal, and am suggesting to use dedicated RNN.

I am aware that djk already pointed out this is RNN, I'm posting this after that answer was accepted per OP's request.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!