Tensorflow: Confusion regarding the adam optimizer

后端未结

关注

 2  1603

栀梦

I\'m confused regarding as to how the adam optimizer actually works in tensorflow.

The way I read the docs, it says that the learning rate is changed every gradient

相关标签:

2条回答

攒了一身酷

2020-12-24 08:51
RMS_PROP and ADAM both have adaptive learning rates .

The basic RMS_PROP
```
cache = decay_rate * cache + (1 - decay_rate) * dx**2
x += - learning_rate * dx / (np.sqrt(cache) + eps)
```
You can see originally this has two parameters decay_rate & eps

Then we can add a momentum to make our gradient more stable Then we can write
```
cache = decay_rate * cache + (1 - decay_rate) * dx**2
**m = beta1*m + (1-beta1)*dx**  [beta1 =momentum parameter in the doc ]
x += - learning_rate * dx / (np.sqrt(cache) + eps)
```
Now you can see here if we keep beta1 = o Then it's rms_prop without the momentum .

Then Basics of ADAM

In cs-231 Andrej Karpathy has initially described the adam like this

Adam is a recently proposed update that looks a bit like RMSProp with momentum

So yes ! Then what makes this difference from the rms_prop with momentum ?
```
m = beta1*m + (1-beta1)*dx
v = beta2*v + (1-beta2)*(dx**2)
**x += - learning_rate * m / (np.sqrt(v) + eps)**
```
He again mentioned in the updating equation m , v are more smooth .

So the difference from the rms_prop is the update is less noisy .

What makes this noise ?

Well in the initialization procedure we will initialize m and v as zero .

m=v=0

In order to reduce this initializing effect it's always to have some warm-up . So then equation is like
```
m = beta1*m + (1-beta1)*dx          beta1 -o.9 beta2-0.999
**mt = m / (1-beta1**t)**
v = beta2*v + (1-beta2)*(dx**2)
**vt = v / (1-beta2**t)**
x += - learning_rate * mt / (np.sqrt(vt) + eps)
```
Now we run this for few iterations . Clearly pay attention to the bold lines , you can see when t is increasing (iteration number) following thing happen to the mt ,

mt = m
0 讨论(0)
发布评论:

提交评论
- 加载中...
-上瘾入骨i

2020-12-24 09:09
I find the documentation quite clear, I will paste here the algorithm in pseudo-code:

Your parameters:
- learning_rate: between 1e-4 and 1e-2 is standard
- beta1: 0.9 by default
- beta2: 0.999 by default
- epsilon: 1e-08 by default
  
  The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1.
Initialization:
```
m_0 <- 0 (Initialize initial 1st moment vector)
v_0 <- 0 (Initialize initial 2nd moment vector)
t <- 0 (Initialize timestep)
```
m_t and v_t will keep track of a moving average of the gradient and its square, for each parameters of the network. (So if you have 1M parameters, Adam will keep in memory 2M more parameters)

At each iteration t, and for each parameter of the model:
```
t <- t + 1
lr_t <- learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t)

m_t <- beta1 * m_{t-1} + (1 - beta1) * gradient
v_t <- beta2 * v_{t-1} + (1 - beta2) * gradient ** 2
variable <- variable - lr_t * m_t / (sqrt(v_t) + epsilon)
```
Here lr_t is a bit different from learning_rate because for early iterations, the moving averages have not converged yet so we have to normalize by multiplying by sqrt(1 - beta2^t) / (1 - beta1^t). When t is high (t > 1./(1.-beta2)), lr_t is almost equal to learning_rate

To answer your question, you just need to pass a fixed learning rate, keep beta1 and beta2 default values, maybe modify epsilon, and Adam will do the magic :)

Link with RMSProp

Adam with beta1=1 is equivalent to RMSProp with momentum=0. The argument beta2 of Adam and the argument decay of RMSProp are the same.

However, RMSProp does not keep a moving average of the gradient. But it can maintain a momentum, like MomentumOptimizer.

A detailed description of rmsprop.
- maintain a moving (discounted) average of the square of gradients
- divide gradient by the root of this average
- (can maintain a momentum)
Here is the pseudo-code:
```
v_t <- decay * v_{t-1} + (1-decay) * gradient ** 2
mom = momentum * mom{t-1} + learning_rate * gradient / sqrt(v_t + epsilon)
variable <- variable - mom
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

Tensorflow: Confusion regarding the adam optimizer

Link with RMSProp

A detailed description of rmsprop.