What do I need K.clear_session() and del model for (Keras with Tensorflow-gpu)?

前端 未结 3 1033
情书的邮戳
情书的邮戳 2020-12-01 01:39

What I am doing
I am training and using a convolutional neuron network (CNN) for image-classification using Keras with Tensorflow-gpu as backen

3条回答
  •  孤街浪徒
    2020-12-01 02:06

    During cross-validation, I wanted to run number_of_replicates folds (a.k.a. replicates) to get an average validation loss as a basis for comparison to another algorithm. So I needed to perform cross-validation for two separate algorithms, and I have multiple GPUs available so figured this would not be a problem.

    Unfortunately, I started seeing layer names get thing like _2, _3, etc. appended to them in my loss logs. I also noticed that if I ran through the replicates (a.k.a. folds) sequentially by using a loop in a single script, I ran out of memory on the GPUs.

    This strategy worked for me; I have been running for hours on end now in tmux sessions on an Ubuntu lambda machine, sometimes seeing memory leaks but they are killed off by a timeout function. It requires estimating the length of time it could take to complete each cross-validation fold/replicate; in the code below that number is timeEstimateRequiredPerReplicate (best to double the number of trips through the loop in case half of them get killed off):

    from multiprocessing import Process
    
    # establish target for process workers
    def machine():
        import tensorflow as tf
        from tensorflow.keras.backend import clear_session
    
        from tensorflow.python.framework.ops import disable_eager_execution
        import gc
    
        clear_session()
    
        disable_eager_execution()  
        nEpochs = 999 # set lower if not using tf.keras.callbacks.EarlyStopping in callbacks
        callbacks = ... # establish early stopping, logging, etc. if desired
    
        algorithm_model = ... # define layers, output(s), etc.
        opt_algorithm = ... # choose your optimizer
        loss_metric = ... # choose your loss function(s) (in a list for multiple outputs)
        algorithm_model.compile(optimizer=opt_algorithm, loss=loss_metric)
    
        trainData = ... # establish which data to train on (for this fold/replicate only)
        validateData = ... # establish which data to validate on (same caveat as above)
        algorithm_model.fit(
            x=trainData,
            steps_per_epoch=len(trainData),
            validation_data=validateData,
            validation_steps=len(validateData),
            epochs=nEpochs,
            callbacks=callbacks
        )
    
        gc.collect()
        del algorithm_model
    
        return
    
    
    # establish main loop to start each process
    def main_loop():
        for replicate in range(replicatesDesired - replicatesCompleted):
            print(
                '\nStarting cross-validation replicate {} '.format(
                    replicate +
                    replicatesCompleted + 1
                ) +
                'of {} desired:\n'.format(
                    replicatesDesired
                )
            )
            p = Process(target=process_machine)
            p.start()
            flag = p.join(timeEstimateRequiredPerReplicate)
            print('\n\nSubprocess exited with code {}.\n\n'.format(flag))
        return
    
    
    # enable running of this script from command line
    if __name__ == "__main__":
        main_loop()
    
    

提交回复
热议问题