UPDATE: This question was for Tensorflow 1.x. I upgraded to 2.0 and (at least on the simple code below) the reproducibility issue seems fixed on 2.0. So that solves my p
You have a couple option for stabilizing performance...
1) Set the seed for your intializers so they are always initialized to the same values.
2) More data generally results in a more stable convergence.
3) Lower learning rates and bigger batch sizes are also good for more predictable learning.
4) Training based on a fixed amount of epochs instead of using callbacks to modify hyperparams during train.
5) K-fold validation to train on different subsets. The average of these folds should result in a fairly predictable metric.
6) Also you have the option of just training multiple times and taking an average of this.