问题
I've been trying the TensorFlow tutorial scripts on Google Cloud ML. In particular I've used the cifar10 CNN tutorial scripts at https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10.
When I run this training script in Google Cloud ML, there is a memory leak of around 0.5% per hour.
I have not made any changes to the scripts other than packaging them into the required GCP format (as described in https://cloud.google.com/ml-engine/docs/how-tos/packaging-trainer) and setting the data location to the storage bucket containing the .bin data files.
If I run locally i.e. not in Google Cloud, and use TCMALLOC, by setting LD_PRELOAD="/usr/lib/libtcmalloc.so", the memory leak is resolved. However, I do not have this option with Google Cloud ML.
What could be causing the leak, and what can I do to fix this? Why aren't other users noticing the same problem? Although the leak is small, it is big enough to cause my training sessions to run out of memory and fail, when I run against my own data for several days. The leak happens regardless of the number of GPUs I use.
The gcloud command I used is:
gcloud ml-engine jobs submit training cifar10_job --job-dir gs://tfoutput/joboutput --package-path trainer --module-name=trainer.cifar10_multi_gpu_train --region europe-west1 --staging-bucket gs://tfoutput --scale-tier CUSTOM --config config.yml --runtime-version 1.0 -- --num_gpus=4
The config file (config.yml) is:
trainingInput:
scaleTier: CUSTOM
masterType: complex_model_m_gpu
Any help appreciated, thanks.
回答1:
We recommend using this version of the code:
github.com/tensorflow/models/pull/1538
Which has performance benefits (by running for less time, you're less prone to OOMs).
That of course, may not be the permanent fix, however, according to our testing, TensorFlow 1.2 appears to address the issue. TensorFlow 1.2 will be available soon on CloudML Engine. If you continue to have problems, please let us know.
来源:https://stackoverflow.com/questions/44412820/memory-leak-in-tensorflow-google-cloud-ml-training