How does the epsilon hyperparameter affect tf.train.AdamOptimizer?

梦想的初衷 提交于 2019-12-21 22:18:39

问题


When I set epsilon=10e-8, AdamOptimizer doesn't work. When I set it to 1, it works just fine.


回答1:


t <- t + 1

lr_t <- learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t)

m_t <- beta1 * m_{t-1} + (1 - beta1) * g

v_t <- beta2 * v_{t-1} + (1 - beta2) * g * g

where g is gradient

variable <- variable - lr_t * m_t / (sqrt(v_t) + epsilon)

The epsilon is to avoid divide by zero error in the above equation while updating the variable when the gradient is almost zero. So, ideally epsilon should be a small value. But, having a small epsilon in the denominator will make larger weight updates and with subsequent normalization larger weights will always be normalized to 1.

So, I guess when you train with small epsilon the optimizer will become unstable.

The trade-off is that the bigger you make epsilon (and the denominator), the smaller the weight updates are and thus slower the training progress will be. Most times you want the denominator to be able to get small. Usually, the epsilon value greater than 10e-4 performs better.

The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1. check here



来源:https://stackoverflow.com/questions/43221065/how-does-the-epsilon-hyperparameter-affect-tf-train-adamoptimizer

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!