nvidia-smi Volatile GPU-Utilization explanation?

时光总嘲笑我的痴心妄想 提交于 2020-01-09 03:03:48

问题


I know that nvidia-smi -l 1 will give the GPU usage every one second (similarly to the following). However, I would appreciate an explanation on what Volatile GPU-Util really means. Is that the number of used SMs over total SMs, or the occupancy, or something else?

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48                 Driver Version: 367.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K20c          Off  | 0000:03:00.0     Off |                    0 |
| 30%   41C    P0    53W / 225W |      0MiB /  4742MiB |     96%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K20c          Off  | 0000:43:00.0     Off |                    0 |
| 36%   49C    P0    95W / 225W |   4516MiB /  4742MiB |     63%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    1      5193    C   python                                        4514MiB |
+-----------------------------------------------------------------------------+

回答1:


It is a sampled measurement over a time period. For a given time period, it reports what percentage of time one or more GPU kernel(s) was active (i.e. running).

It doesn't tell you anything about how many SMs were used, or how "busy" the code was, or what it was doing exactly, or in what way it may have been using memory.

The above claim(s) can be verified without too much difficulty using a microbenchmarking-type exercise (see below).

I don't know how to define the time period exactly, but since it is also overall just a sampled measurement (i.e. nvidia-smi reports one sampled measurement as often as you poll it) I don't think it should be that important for general usage or understanding of the tool. The time period is obviously short, and is not necessarily related to the polling interval, if one is specified, for nvidia-smi. It might be possible to uncover the sampling time period using microbenchmarking techniques also.

Also, the word "Volatile" does not pertain to this data item in nvidia-smi. You are misreading the output format.

Here's a trivial code that supports my claim:

#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>

const long long tdelay=1000000LL;
const int loops = 10000;
const int hdelay = 1;

__global__ void dkern(){

  long long start = clock64();
  while(clock64() < start+tdelay);
}

int main(int argc, char *argv[]){

  int my_delay = hdelay;
  if (argc > 1) my_delay = atoi(argv[1]);
  for (int i = 0; i<loops; i++){
    dkern<<<1,1>>>();
    usleep(my_delay);}

  return 0;
}

On my system, when I run the above code with a command line parameter of 100, nvidia-smi will report 99% utilization. When I run with a command line parameter of 1000, nvidia-smi will report ~83% utilization. When I run it with a command line parameter of 10000, nvidia-smi will report ~9% utilization.




回答2:


nvidia-smi -l 1 will continuously print on to the console. Can be useful if you need to log usage to a file.

watch -n 1 nvidia-smi is a less confusing option which calls nvidia-smi every 1s.



来源:https://stackoverflow.com/questions/40937894/nvidia-smi-volatile-gpu-utilization-explanation

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!