Tensorflow: Confusion regarding the adam optimizer

后端未结

关注

 2  1597

栀梦 2020-12-24 08:26

I\'m confused regarding as to how the adam optimizer actually works in tensorflow.

The way I read the docs, it says that the learning rate is changed every gradient

2条回答

攒了一身酷 (楼主)

2020-12-24 08:51
RMS_PROP and ADAM both have adaptive learning rates .

The basic RMS_PROP
```
cache = decay_rate * cache + (1 - decay_rate) * dx**2
x += - learning_rate * dx / (np.sqrt(cache) + eps)
```
You can see originally this has two parameters decay_rate & eps

Then we can add a momentum to make our gradient more stable Then we can write
```
cache = decay_rate * cache + (1 - decay_rate) * dx**2
**m = beta1*m + (1-beta1)*dx**  [beta1 =momentum parameter in the doc ]
x += - learning_rate * dx / (np.sqrt(cache) + eps)
```
Now you can see here if we keep beta1 = o Then it's rms_prop without the momentum .

Then Basics of ADAM

In cs-231 Andrej Karpathy has initially described the adam like this

Adam is a recently proposed update that looks a bit like RMSProp with momentum

So yes ! Then what makes this difference from the rms_prop with momentum ?
```
m = beta1*m + (1-beta1)*dx
v = beta2*v + (1-beta2)*(dx**2)
**x += - learning_rate * m / (np.sqrt(v) + eps)**
```
He again mentioned in the updating equation m , v are more smooth .

So the difference from the rms_prop is the update is less noisy .

What makes this noise ?

Well in the initialization procedure we will initialize m and v as zero .

m=v=0

In order to reduce this initializing effect it's always to have some warm-up . So then equation is like
```
m = beta1*m + (1-beta1)*dx          beta1 -o.9 beta2-0.999
**mt = m / (1-beta1**t)**
v = beta2*v + (1-beta2)*(dx**2)
**vt = v / (1-beta2**t)**
x += - learning_rate * mt / (np.sqrt(vt) + eps)
```
Now we run this for few iterations . Clearly pay attention to the bold lines , you can see when t is increasing (iteration number) following thing happen to the mt ,

mt = m
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...