What do I need K.clear_session() and del model for (Keras with Tensorflow-gpu)?

前端未结

关注

 3  1033

情书的邮戳 2020-12-01 01:39

What I am doing
I am training and using a convolutional neuron network (CNN) for image-classification using Keras with Tensorflow-gpu as backen

3条回答

孤街浪徒 (楼主)

2020-12-01 02:06

During cross-validation, I wanted to run number_of_replicates folds (a.k.a. replicates) to get an average validation loss as a basis for comparison to another algorithm. So I needed to perform cross-validation for two separate algorithms, and I have multiple GPUs available so figured this would not be a problem.

Unfortunately, I started seeing layer names get thing like _2, _3, etc. appended to them in my loss logs. I also noticed that if I ran through the replicates (a.k.a. folds) sequentially by using a loop in a single script, I ran out of memory on the GPUs.

This strategy worked for me; I have been running for hours on end now in tmux sessions on an Ubuntu lambda machine, sometimes seeing memory leaks but they are killed off by a timeout function. It requires estimating the length of time it could take to complete each cross-validation fold/replicate; in the code below that number is timeEstimateRequiredPerReplicate (best to double the number of trips through the loop in case half of them get killed off):

from multiprocessing import Process

# establish target for process workers
def machine():
    import tensorflow as tf
    from tensorflow.keras.backend import clear_session

    from tensorflow.python.framework.ops import disable_eager_execution
    import gc

    clear_session()

    disable_eager_execution()  
    nEpochs = 999 # set lower if not using tf.keras.callbacks.EarlyStopping in callbacks
    callbacks = ... # establish early stopping, logging, etc. if desired

    algorithm_model = ... # define layers, output(s), etc.
    opt_algorithm = ... # choose your optimizer
    loss_metric = ... # choose your loss function(s) (in a list for multiple outputs)
    algorithm_model.compile(optimizer=opt_algorithm, loss=loss_metric)

    trainData = ... # establish which data to train on (for this fold/replicate only)
    validateData = ... # establish which data to validate on (same caveat as above)
    algorithm_model.fit(
        x=trainData,
        steps_per_epoch=len(trainData),
        validation_data=validateData,
        validation_steps=len(validateData),
        epochs=nEpochs,
        callbacks=callbacks
    )

    gc.collect()
    del algorithm_model

    return


# establish main loop to start each process
def main_loop():
    for replicate in range(replicatesDesired - replicatesCompleted):
        print(
            '\nStarting cross-validation replicate {} '.format(
                replicate +
                replicatesCompleted + 1
            ) +
            'of {} desired:\n'.format(
                replicatesDesired
            )
        )
        p = Process(target=process_machine)
        p.start()
        flag = p.join(timeEstimateRequiredPerReplicate)
        print('\n\nSubprocess exited with code {}.\n\n'.format(flag))
    return


# enable running of this script from command line
if __name__ == "__main__":
    main_loop()

0 讨论(0)

查看其它3个回答