Low GPU Usage & Performance with Tensorflow + RNNs

问题

I have implemented a network that tries to predict a word from a sentence. The network is actually pretty complex, but here’s a simple version of it:

Take indices of words in a sentences and convert to embeddings
Run each sentence through LSTM
Give each word in the sentence a score with a linear multiplication of the LSTM output

And here’s the code:

# 40 samples with random size up to 500, vocabulary size is 10000 with 50 dimensions
def inference(inputs):
   inputs = tf.constant(inputs)
   word_embeddings = tf.nn.embedding_lookup(embeddings, inputs)

   lengths = sequence_length(inputs)

   cell = BasicLSTMCell(cell_size)
   outputs, output_states = tf.nn.bidirectional_dynamic_rnn(cell, cell,     word_embeddings, sequence_length=lengths, dtype=tf.float32)
   lstm_out = tf.concat(outputs, 2)

   words_representations = tf.reshape(lstm_out, [-1, cell_size * 2])
   W = tf.Variable(tf.random_uniform([cell_size * 2, 1], -1, 1), name='W',     dtype=tf.float32)
   words_scores = tf.reshape(tf.matmul(words_representations, W), [-1,     max_length])

   return words_scores

Reproducible code on gist

My problem is low GPU utilization, around 40%, using Titan X.

A few notes:

I am using a batch size of 40. While I can make it bigger, say 1000, and get very good speeds per sample (remains almost the same time per batch! ~0.7s vs ~0.8s) - I want my batch size to remain at this size due to various reasons, among others to remain similar to an article I’m trying to implement.
In this example the given batch is also the entire data, so there are no issues of batching, queuing, etc, which might make it easier to understand the problem.

What I suspect to be the problem, and as noted by answers to similar problems around forums, is repeated CPU to GPU memory transfers. This is how the tracing result looks:

The green bars (pid 9 and 11) are all MEMCPYHtoD or MEMCPYDtoH. How can I avoid them?

What can I do to make speed performance better without making the batch size bigger?

The sample reproducible code can be found here. Using Tensorflow 1.0, cudnn 5.1, cuda 8

Thanks.

来源：https://stackoverflow.com/questions/42319786/low-gpu-usage-performance-with-tensorflow-rnns

标签

tensorflow

gpu

lstm