My machine has the following spec:
CPU: Xeon E5-1620 v4
GPU: Titan X (Pascal)
Ubuntu 16.04
Nvidia driver 375.26
CUDA tookit 8.0
If you use Keras, use CuDNNLSTM in place of LSTM or CuDNNGRU in place of GRU. In my case (2 Tesla M60), I am seeing 10x boost of performance. By the way I am using batch size 128 as suggested by @Alexey Golyshev.