问题
I'm trying to train an LSTM model on daily fundamental and price data from ~4000 stocks, due to memory limits I cannot hold everything in memory after converting to sequences for the model.
This leads me to using a generator instead like the TimeseriesGenerator from Keras / Tensorflow. Problem is that if I try using the generator on all of my data stacked it would create sequences of mixed stocks, see the example below with a sequence of 5, here Sequence 3 would include the last 4 observations of "stock 1" and the first observation of "stock 2"
Instead what I would want is similar to this:
Slightly similar question: Merge or append multiple Keras TimeseriesGenerator objects into one
I explored the option of combining the generators like this SO suggests: How do I combine two keras generator functions, however this is not idea in the case of ~4000 generators.
I hope my question makes sense.
回答1:
So what I've ended up doing is to do all the preprocessing manually and save an .npy file for each stock containing the preprocessed sequences, then using a manually created generator I make batches like this:
class seq_generator():
def __init__(self, list_of_filepaths):
self.usedDict = dict()
for path in list_of_filepaths:
self.usedDict[path] = []
def generate(self):
while True:
path = np.random.choice(list(self.usedDict.keys()))
stock_array = np.load(path)
random_sequence = np.random.randint(stock_array.shape[0])
if random_sequence not in self.usedDict[path]:
self.usedDict[path].append(random_sequence)
yield stock_array[random_sequence, :, :]
train_generator = seq_generator(list_of_filepaths)
train_dataset = tf.data.Dataset.from_generator(seq_generator.generate(),
output_types=(tf.float32, tf.float32),
output_shapes=(n_timesteps, n_features))
train_dataset = train_dataset.batch(batch_size)
Where list_of_filepaths
is simply a list of paths to preprocessed .npy data.
This will:
- Load a random stock's preprocessed .npy data
- Pick a sequence at random
- Check if the index of the sequence has already been used in
usedDict
- If not:
- Append the index of that sequence to
usedDict
to keep track as to not feed the same data twice to the model - Yield the sequence
- Append the index of that sequence to
This means that the generator will feed a single unique sequence from a random stock at each "call", enabling me to use the .from_generator()
and .batch()
methods from Tensorflows Dataset type.
来源:https://stackoverflow.com/questions/61177311/creating-a-timeseriesgenerator-with-multiple-inputs