问题
this question regards the common problem of training on multiple large files in Keras which are jointly too large to fit on GPU memory. I am using Keras 1.0.5 and I would like a solution that does not require 1.0.6. One way to do this was described by fchollet here and here:
# Create generator that yields (current features X, current labels y)
def BatchGenerator(files):
for file in files:
current_data = pickle.load(open("file", "rb"))
X_train = current_data[:,:-1]
y_train = current_data[:,-1]
yield (X_train, y_train)
# train model on each dataset
for epoch in range(n_epochs):
for (X_train, y_train) in BatchGenerator(files):
model.fit(X_train, y_train, batch_size = 32, nb_epoch = 1)
However I fear that the state of the model is not saved, rather that the model is reinitialized not only between epochs but also between datasets. Each "Epoch 1/1" represents training on a different dataset below:
~~~~~ Epoch 0 ~~~~~~
Epoch 1/1
295806/295806 [==============================] - 13s - loss: 15.7517
Epoch 1/1
407890/407890 [==============================] - 19s - loss: 15.8036
Epoch 1/1
383188/383188 [==============================] - 19s - loss: 15.8130
~~~~~ Epoch 1 ~~~~~~
Epoch 1/1
295806/295806 [==============================] - 14s - loss: 15.7517
Epoch 1/1
407890/407890 [==============================] - 20s - loss: 15.8036
Epoch 1/1
383188/383188 [==============================] - 15s - loss: 15.8130
I am aware that one can use model.fit_generator but as the method above was repeatedly suggested as a way of batch training I would like to know what I am doing wrong.
Thanks for your help,
Max
回答1:
It has been a while since I faced that problem but I remember that I used
Kera's functionality to provide data through Python generators, i.e. model = Sequential(); model.fit_generator(...)
.
An exemplary code snippet (should be self-explanatory)
def generate_batches(files, batch_size):
counter = 0
while True:
fname = files[counter]
print(fname)
counter = (counter + 1) % len(files)
data_bundle = pickle.load(open(fname, "rb"))
X_train = data_bundle[0].astype(np.float32)
y_train = data_bundle[1].astype(np.float32)
y_train = y_train.flatten()
for cbatch in range(0, X_train.shape[0], batch_size):
yield (X_train[cbatch:(cbatch + batch_size),:,:], y_train[cbatch:(cbatch + batch_size)])
model = Sequential()
model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
train_files = [train_bundle_loc + "bundle_" + cb.__str__() for cb in range(nb_train_bundles)]
gen = generate_batches(files=train_files, batch_size=batch_size)
history = model.fit_generator(gen, samples_per_epoch=samples_per_epoch, nb_epoch=num_epoch,verbose=1, class_weight=class_weights)
来源:https://stackoverflow.com/questions/38805375/keras-batch-training-for-multiple-large-datasets