Why is GeForce GTX 1080 Ti slower than Quadro K1200 on training a RNN model?

问题

Problem type: regression

Inputs: sequence length varies from 14 to 39, each sequence point is a 4-element vector.

Output: a scalar

Neural Network: 3-layer Bi-LSTM (hidden vector size: 200) followed by 2 Fully Connected layers

Batch Size: 30

Number of samples per epoch: ~7,000

TensorFlow version: tf-nightly-gpu 1.6.0-dev20180112

CUDA version: 9.0

CuDNN version: 7

Details of the two GPUs:

GPU 0: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582 totalMemory: 11.00GiB freeMemory: 10.72GiB

device_placement_log_0.txt

nvidia-smi during the run (using 1080 Ti only):

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 385.69                 Driver Version: 385.69                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108... WDDM  | 00000000:02:00.0 Off |                  N/A |
| 20%   37C    P2    58W / 250W |  10750MiB / 11264MiB |     10%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Quadro K1200       WDDM  | 00000000:03:00.0  On |                  N/A |
| 39%   35C    P8     1W /  31W |    751MiB /  4096MiB |      2%      Default |
+-------------------------------+----------------------+----------------------+

GPU 1: name: Quadro K1200 major: 5 minor: 0 memoryClockRate(GHz): 1.0325 totalMemory: 4.00GiB freeMemory: 3.44GiB

device_placement_log_1.txt

nvidia-smi during the run (using K1200 only):

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 385.69                 Driver Version: 385.69                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108... WDDM  | 00000000:02:00.0 Off |                  N/A |
| 20%   29C    P8     8W / 250W |    136MiB / 11264MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Quadro K1200       WDDM  | 00000000:03:00.0  On |                  N/A |
| 39%   42C    P0     6W /  31W |   3689MiB /  4096MiB |     23%      Default |
+-------------------------------+----------------------+----------------------+

Time spent for 1 epoch:

GPU 0 only (set environment var "CUDA_VISIBLE_DEVICES"=0): ~60 minutes

GPU 1 only (set environment var "CUDA_VISIBLE_DEVICES"=1): ~45 minutes

Set env. var. to "TF_MIN_GPU_MULTIPROCESSOR_COUNT=4" during both tests.

Why is the better GPU (GeForce GTX 1080 Ti) slower on training my neural network?

Thanks in advance.

Update

Another set of tests on MNIST dataset using a CNN model showed the same pattern:

Time spent for training 17 epochs:

GPU 0 (1080 Ti): ~59 minutes

GPU 1 (K1200): ~45 minutes

回答1:

The official tensorflow document has the section "Allowing GPU memory growth" introducing two session options to control GPU memory allocation. I tried them separately to train my RNN model (using only GeForce GTX 1080 Ti):

config.gpu_options.allow_growth = True and
config.gpu_options.per_process_gpu_memory_fraction = 0.05

Both of them shortened the training time from the original ~60 minutes per epoch to ~42 minutes per epoch. I still don't understand why this helps. If you can explain it, I will accept that as the answer. Thanks.

来源：https://stackoverflow.com/questions/48236274/why-is-geforce-gtx-1080-ti-slower-than-quadro-k1200-on-training-a-rnn-model

标签

tensorflow

lstm