Tensorflow: How do you monitor GPU performance during model training in real-time?

问题

I am new to Ubuntu and GPUs and have recently been using a new PC with Ubuntu 16.04 and 4 NVIDIA 1080ti GPUs in our lab. The machine also has an i7 16 core processor.

I have some basic questions:

Tensorflow is installed for GPU. I presume then, that it automatically prioritises GPU usage? If so, does it use all 4 together or does it use 1 and then recruit another if needed?
Can I monitor in real-time, the GPU use/activity during training of a model?

I fully understand this is basic hardware stuff but clear definitive answers to these specific questions would be great.

EDIT:

Based on this output - it this really saying that nearly all the memory on each one of my GPUs is being used?

回答1:

Tensorflow automatically doesn't utilize all GPUs, it will use only one GPU, specifically first gpu /gpu:0

You have to write multi gpus code to utilize all gpus available. cifar mutli-gpu example
to check usage every 0.1 seconds

watch -n0.1 nvidia-smi

回答2:

If no other indication is given, a GPU-enabled TensorFlow installation will default to use the first available GPU (as long as you have the Nvidia driver and CUDA 8.0 installed and the GPU has the necessary compute capability, which, according to the docs is 3.0). If you want to use more GPUs, you need to use tf.device directives in your graph (more about it here).
The easiest way to check the GPU usage is the console tool nvidia-smi. However, unlike top or other similar programs, it only shows the current usage and finishes. As suggested in the comments, you can use something like watch -n1 nvidia-smi to re-run the program continuously (in this case every second).

回答3:

All the above commands use watch, it's much more efficient to keep the context alive by using the builin looper: nvidia-smi -l 1.

If you want to see something like htop and nvidia-smi at the same time, you can try glances (pip install glances).

回答4:

If you are using GCP, please take a look at this script which allows you to monitor GPU utilization in StackDriver, you can also use it to collect nvidia-smi data using nvidia-smi -l 5 command and reporting those statistics for you to track.

https://github.com/GoogleCloudPlatform/ml-on-gcp/tree/master/dlvm/gcp-gpu-utilization-metrics

来源：https://stackoverflow.com/questions/45544603/tensorflow-how-do-you-monitor-gpu-performance-during-model-training-in-real-tim

标签

performance

tensorflow

gpu