问题
I have implemented a network that tries to predict a word from a sentence. The network is actually pretty complex, but here’s a simple version of it:
- Take indices of words in a sentences and convert to embeddings
- Run each sentence through LSTM
- Give each word in the sentence a score with a linear multiplication of the LSTM output
And here’s the code:
# 40 samples with random size up to 500, vocabulary size is 10000 with 50 dimensions
def inference(inputs):
inputs = tf.constant(inputs)
word_embeddings = tf.nn.embedding_lookup(embeddings, inputs)
lengths = sequence_length(inputs)
cell = BasicLSTMCell(cell_size)
outputs, output_states = tf.nn.bidirectional_dynamic_rnn(cell, cell, word_embeddings, sequence_length=lengths, dtype=tf.float32)
lstm_out = tf.concat(outputs, 2)
words_representations = tf.reshape(lstm_out, [-1, cell_size * 2])
W = tf.Variable(tf.random_uniform([cell_size * 2, 1], -1, 1), name='W', dtype=tf.float32)
words_scores = tf.reshape(tf.matmul(words_representations, W), [-1, max_length])
return words_scores
Reproducible code on gist
My problem is low GPU utilization, around 40%, using Titan X.
A few notes:
- I am using a batch size of 40. While I can make it bigger, say 1000, and get very good speeds per sample (remains almost the same time per batch! ~0.7s vs ~0.8s) - I want my batch size to remain at this size due to various reasons, among others to remain similar to an article I’m trying to implement.
- In this example the given batch is also the entire data, so there are no issues of batching, queuing, etc, which might make it easier to understand the problem.
What I suspect to be the problem, and as noted by answers to similar problems around forums, is repeated CPU to GPU memory transfers. This is how the tracing result looks:
The green bars (pid 9 and 11) are all MEMCPYHtoD or MEMCPYDtoH. How can I avoid them?
What can I do to make speed performance better without making the batch size bigger?
The sample reproducible code can be found here. Using Tensorflow 1.0, cudnn 5.1, cuda 8
Thanks.
来源:https://stackoverflow.com/questions/42319786/low-gpu-usage-performance-with-tensorflow-rnns