I\'m testing out my new NVIDIA Titan V, which supports float16 operations. I noticed that during training, float16 is much slower (~800 ms/step) than float32 (~500 ms/step)
I updated to CUDA 10.0, cuDNN 7.4.1, tensorflow 1.13.1, keras 2.2.4, and python 3.7.3. Using the same code as in the OP, training time was marginally faster with float16 over float32.
I fully expect that a more complex network architecture would show a bigger difference in performance, but I didn't test this.