What I am doing
I am training and using a convolutional neuron network (CNN) for image-classification using Keras with Tensorflow-gpu as backen
During cross-validation, I wanted to run number_of_replicates folds (a.k.a. replicates) to get an average validation loss as a basis for comparison to another algorithm. So I needed to perform cross-validation for two separate algorithms, and I have multiple GPUs available so figured this would not be a problem.
Unfortunately, I started seeing layer names get thing like _2, _3, etc. appended to them in my loss logs. I also noticed that if I ran through the replicates (a.k.a. folds) sequentially by using a loop in a single script, I ran out of memory on the GPUs.
This strategy worked for me; I have been running for hours on end now in tmux sessions on an Ubuntu lambda machine, sometimes seeing memory leaks but they are killed off by a timeout function. It requires estimating the length of time it could take to complete each cross-validation fold/replicate; in the code below that number is timeEstimateRequiredPerReplicate (best to double the number of trips through the loop in case half of them get killed off):
from multiprocessing import Process
# establish target for process workers
def machine():
import tensorflow as tf
from tensorflow.keras.backend import clear_session
from tensorflow.python.framework.ops import disable_eager_execution
import gc
clear_session()
disable_eager_execution()
nEpochs = 999 # set lower if not using tf.keras.callbacks.EarlyStopping in callbacks
callbacks = ... # establish early stopping, logging, etc. if desired
algorithm_model = ... # define layers, output(s), etc.
opt_algorithm = ... # choose your optimizer
loss_metric = ... # choose your loss function(s) (in a list for multiple outputs)
algorithm_model.compile(optimizer=opt_algorithm, loss=loss_metric)
trainData = ... # establish which data to train on (for this fold/replicate only)
validateData = ... # establish which data to validate on (same caveat as above)
algorithm_model.fit(
x=trainData,
steps_per_epoch=len(trainData),
validation_data=validateData,
validation_steps=len(validateData),
epochs=nEpochs,
callbacks=callbacks
)
gc.collect()
del algorithm_model
return
# establish main loop to start each process
def main_loop():
for replicate in range(replicatesDesired - replicatesCompleted):
print(
'\nStarting cross-validation replicate {} '.format(
replicate +
replicatesCompleted + 1
) +
'of {} desired:\n'.format(
replicatesDesired
)
)
p = Process(target=process_machine)
p.start()
flag = p.join(timeEstimateRequiredPerReplicate)
print('\n\nSubprocess exited with code {}.\n\n'.format(flag))
return
# enable running of this script from command line
if __name__ == "__main__":
main_loop()