Keras model training memory leak

问题

I'm new with Keras, Tensorflow, Python and I'm trying to build a model for personal use/future learning. I've just started with python and I came up with this code (with help of videos and tutorials). My problem is that my memory usage of Python is slowly creeping up with each epoch and even after constructing new model. Once the memory is at 100% the training just stop with no error/warning. I don´t know too much but the issue should be somewhere within the loop (If I´m not mistaken). I know about

k.clear.session()

but either the issue was not removed or I don´t know how to integrate it in my code. I have: Python v 3.6.4, Tensorflow 2.0.0rc1 (cpu version), Keras 2.3.0

This is my code:

import pandas as pd
import os
import time
import tensorflow as tf
import numpy as np
import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, LSTM, BatchNormalization
from tensorflow.keras.callbacks import TensorBoard, ModelCheckpoint

EPOCHS = 25
BATCH_SIZE = 32           

df = pd.read_csv("EntryData.csv", names=['1SH5', '1SHA', '1SA5', '1SAA', '1WH5', '1WHA',
                                         '2SA5', '2SAA', '2SH5', '2SHA', '2WA5', '2WAA',
                                         '3R1', '3R2', '3R3', '3R4', '3R5', '3R6',
                                         'Target'])

df_val = 14554 

validation_df = df[df.index > df_val]
df = df[df.index <= df_val]

train_x = df.drop(columns=['Target'])
train_y = df[['Target']]
validation_x = validation_df.drop(columns=['Target'])
validation_y = validation_df[['Target']]

train_x = np.asarray(train_x)
train_y = np.asarray(train_y)
validation_x = np.asarray(validation_x)
validation_y = np.asarray(validation_y)

train_x = train_x.reshape(train_x.shape[0], 1, train_x.shape[1])
validation_x = validation_x.reshape(validation_x.shape[0], 1, validation_x.shape[1])

dense_layers = [0, 1, 2]
layer_sizes = [32, 64, 128]
conv_layers = [1, 2, 3]

for dense_layer in dense_layers:
    for layer_size in layer_sizes:
        for conv_layer in conv_layers:
            NAME = "{}-conv-{}-nodes-{}-dense-{}".format(conv_layer, layer_size, 
                    dense_layer, int(time.time()))
            tensorboard = TensorBoard(log_dir="logs\{}".format(NAME))
            print(NAME)

            model = Sequential()
            model.add(LSTM(layer_size, input_shape=(train_x.shape[1:]), 
                                       return_sequences=True))
            model.add(Dropout(0.2))
            model.add(BatchNormalization())

            for l in range(conv_layer-1):
                model.add(LSTM(layer_size, return_sequences=True))
                model.add(Dropout(0.1))
                model.add(BatchNormalization())

            for l in range(dense_layer):
                model.add(Dense(layer_size, activation='relu'))
                model.add(Dropout(0.2))

            model.add(Dense(2, activation='softmax'))

            opt = tf.keras.optimizers.Adam(lr=0.001, decay=1e-6)

            # Compile model
            model.compile(loss='sparse_categorical_crossentropy',
                          optimizer=opt,
                          metrics=['accuracy'])

            # unique file name that will include the epoch 
            # and the validation acc for that epoch
            filepath = "RNN_Final.{epoch:02d}-{val_accuracy:.3f}"  
            checkpoint = ModelCheckpoint("models\{}.model".format(filepath, 
                         monitor='val_acc', verbose=0, save_best_only=True, 
                         mode='max')) # saves only the best ones

            # Train model
            history = model.fit(
                train_x, train_y,
                batch_size=BATCH_SIZE,
                epochs=EPOCHS,
                validation_data=(validation_x, validation_y),
                callbacks=[tensorboard, checkpoint])

# Score model
score = model.evaluate(validation_x, validation_y, verbose=2)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
# Save model
model.save("models\{}".format(NAME))

Also I don´t know If it´s possible to ask 2 problems within 1 question (I don´t want to spam it here with my problems which anyone with any python experience can resolve within a minute), but I also have problem with checkpoint saving. I want to save only the best performing model (1 model per 1 NN specification - number of nodes/layers) but currently it is saved after every epoch. If this is inappropriate to ask I can create another question for this.

Thank you very much for any help.

回答1:

One source of the problem is, a new loop of model = Sequential() does not remove the previous model; it remains built within its TensorFlow graph scope, and every new model = Sequential() adds another lingering construction which eventually overflows memory. To ensure a model is properly destroyed in full, run below once you're done with a model:

import gc
del model
gc.collect()
K.clear_session()
tf.compat.v1.reset_default_graph() # TF graph isn't same as Keras graph

gc is Python's garbage collection module, which clears remnant traces of model after del. K.clear_session() is the main call, and clears the TensorFlow graph.

Also, while your idea for model checkpointing, logging, and hyperparameter search is quite sound, it's quite faultily executed; you will actually be testing only one hyperparameter combination for the entire nested loop you've set up there. But this should be asked in a separate question.

UPDATE: just encountered the same problem, on a fully properly setup environment; the likeliest conclusion is, it's a bug - and a definite culprit is Eager execution. To work around, use

tf.compat.v1.disable_eager_execution() # right after `import tensorflow as tf`

to switch to Graph mode, which can also run significantly faster. Also see updated clear code above.

来源：https://stackoverflow.com/questions/58137677/keras-model-training-memory-leak

标签

python

tensorflow

memory

keras

checkpoint