I\'m using tensorflow 0.10 and I was benchmarking the examples found in the official HowTo on reading data. This HowTo illustrates different methods to move data to tensorf
The main question is that why the example with the preloaded data (constant) examples/how_tos/reading_data/fully_connected_preloaded.py is significantly slower than other data loading example codes when using GPU.
I had the same problem, that fully_connected_preloaded.py
is unexpectedly slow on my Titan X. The problem was that the whole dataset was pre-loaded on CPU, not GPU.
First, let me share my initial attempts. I applied the following performance tips by Yaroslav.
capacity=55000
for tf.train.slice_input_producer
.(55000 is the size of MNIST training set in my case) num_threads=5
for tf.train.batch
.capacity=500
for tf.train.batch
.time.sleep(10)
after tf.train.start_queue_runners
.However, the average speed per each batch stays the same. I tried timeline
visualization for profiling, and still got QueueDequeueManyV2
dominating.
The problem was the line 65 of fully_connected_preloaded.py
. The following code loads entire dataset to CPU, still providing a bottleneck for CPU-GPU data transmission.
with tf.device('/cpu:0'):
input_images = tf.constant(data_sets.train.images)
input_labels = tf.constant(data_sets.train.labels)
Hence, I switched the device allocation.
with tf.device('/gpu:0')
Then I got x100 speed-up per each batch.
Note:
fully_connected_preloaded.py
), the comment in the line 64 says "rest of pipeline is CPU-only". I am not sure about what this comment intended.