Low GPU Usage & Performance with Tensorflow + RNNs

孤者浪人 提交于 2019-12-21 03:01:25

问题


I have implemented a network that tries to predict a word from a sentence. The network is actually pretty complex, but here’s a simple version of it:

  1. Take indices of words in a sentences and convert to embeddings
  2. Run each sentence through LSTM
  3. Give each word in the sentence a score with a linear multiplication of the LSTM output

And here’s the code:

# 40 samples with random size up to 500, vocabulary size is 10000 with 50 dimensions
def inference(inputs):
   inputs = tf.constant(inputs)
   word_embeddings = tf.nn.embedding_lookup(embeddings, inputs)

   lengths = sequence_length(inputs)

   cell = BasicLSTMCell(cell_size)
   outputs, output_states = tf.nn.bidirectional_dynamic_rnn(cell, cell,     word_embeddings, sequence_length=lengths, dtype=tf.float32)
   lstm_out = tf.concat(outputs, 2)

   words_representations = tf.reshape(lstm_out, [-1, cell_size * 2])
   W = tf.Variable(tf.random_uniform([cell_size * 2, 1], -1, 1), name='W',     dtype=tf.float32)
   words_scores = tf.reshape(tf.matmul(words_representations, W), [-1,     max_length])

   return words_scores

Reproducible code on gist

My problem is low GPU utilization, around 40%, using Titan X.

A few notes:

  • I am using a batch size of 40. While I can make it bigger, say 1000, and get very good speeds per sample (remains almost the same time per batch! ~0.7s vs ~0.8s) - I want my batch size to remain at this size due to various reasons, among others to remain similar to an article I’m trying to implement.
  • In this example the given batch is also the entire data, so there are no issues of batching, queuing, etc, which might make it easier to understand the problem.

What I suspect to be the problem, and as noted by answers to similar problems around forums, is repeated CPU to GPU memory transfers. This is how the tracing result looks:

The green bars (pid 9 and 11) are all MEMCPYHtoD or MEMCPYDtoH. How can I avoid them?

What can I do to make speed performance better without making the batch size bigger?

The sample reproducible code can be found here. Using Tensorflow 1.0, cudnn 5.1, cuda 8

Thanks.

来源:https://stackoverflow.com/questions/42319786/low-gpu-usage-performance-with-tensorflow-rnns

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!