SageMaker fails when using Multi-GPU with keras.utils.multi_gpu_model

雨燕双飞 提交于 2019-12-04 22:43:15

This might not be the best answer for your problem, but this is what I am using for a multi-gpu model with Tensorflow backend. First i initialize using:

def setup_multi_gpus():
    """
    Setup multi GPU usage

    Example usage:
    model = Sequential()
    ...
    multi_model = multi_gpu_model(model, gpus=num_gpu)
    multi_model.fit()

    About memory usage:
    https://stackoverflow.com/questions/34199233/how-to-prevent-tensorflow-from-allocating-the-totality-of-a-gpu-memory
    """
    import tensorflow as tf
    from keras.utils.training_utils import multi_gpu_model
    from tensorflow.python.client import device_lib

    # IMPORTANT: Tells tf to not occupy a specific amount of memory
    from keras.backend.tensorflow_backend import set_session  
    config = tf.ConfigProto()  
    config.gpu_options.allow_growth = True  # dynamically grow the memory used on the GPU  
    sess = tf.Session(config=config)  
    set_session(sess)  # set this TensorFlow session as the default session for Keras.


    # getting the number of GPUs 
    def get_available_gpus():
       local_device_protos = device_lib.list_local_devices()
       return [x.name for x in local_device_protos if x.device_type    == 'GPU']

    num_gpu = len(get_available_gpus())
    print('Amount of GPUs available: %s' % num_gpu)

    return num_gpu

Then i call

# Setup multi GPU usage
num_gpu = setup_multi_gpus()

and create a model.

...

After which you're able to make it a multi GPU model.

multi_model = multi_gpu_model(model, gpus=num_gpu)
multi_model.compile...
multi_model.fit...

The only thing here that is different from what you are doing is the way Tensorflow is initializing the GPU's. I can't imagine it being the problem, but it might be worth trying out.

Good luck!

Edit: I noticed sequence to sequence not being able to work with multi GPU. Is that the type of model you are trying to train?

I apologize for the slow response.

It seems there are a lot of threads that are running in parallel, and I want to link them together, so that other individuals who have the same issue can see the progress and discussion going on.

https://forums.aws.amazon.com/thread.jspa?messageID=881541 https://forums.aws.amazon.com/thread.jspa?messageID=881540

https://github.com/aws/sagemaker-python-sdk/issues/512

There a few questions in regards to this.

What version of TensorFlow and Keras?

I am not too sure what is causing this problem. Does your container have all of the needed dependencies such as CUDA and etc? https://www.tensorflow.org/install/gpu

Were you able to train using single GPU with Keras?

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!