What is the intuition of using tanh in LSTM

后端未结

关注

 2  941

执笔经年 2020-12-22 15:49

In LSTM Network (Understanding LSTMs), Why input gate and output gate use tanh? what is the intuition behind this? it is just a nonlinear transformation? if it is, can I c

2条回答

悲哀的现实 (楼主)

2020-12-22 16:04
Sigmoid specifically, is used as the gating function for the 3 gates(in, out, forget) in LSTM, since it outputs a value between 0 and 1, it can either let no flow or complete flow of information throughout the gates. On the other hand, to overcome the vanishing gradient problem, we need a function whose second derivative can sustain for a long range before going to zero. Tanh is a good function with the above property.

A good neuron unit should be bounded, easily differentiable, monotonic (good for convex optimization) and easy to handle. If you consider these qualities, then i believe you can use ReLU in place of tanh function since they are very good alternatives of each other. But before making a choice for activation functions, you must know what are the advantages and disadvantages of your choice over others. I am shortly describing some of the activation functions and their advantages.

Sigmoid

Mathematical expression: sigmoid(z) = 1 / (1 + exp(-z))

1st order derivative: sigmoid'(z) = -exp(-z) / 1 + exp(-z)^2

Advantages:
```
(1) Sigmoid function has all the fundamental properties of a good activation function.
```
Tanh

Mathematical expression: tanh(z) = [exp(z) - exp(-z)] / [exp(z) + exp(-z)]

1st order derivative: tanh'(z) = 1 - ([exp(z) - exp(-z)] / [exp(z) + exp(-z)])^2 = 1 - tanh^2(z)

Advantages:
```
(1) Often found to converge faster in practice
(2) Gradient computation is less expensive
```
Hard Tanh

Mathematical expression: hardtanh(z) = -1 if z < -1; z if -1 <= z <= 1; 1 if z > 1

1st order derivative: hardtanh'(z) = 1 if -1 <= z <= 1; 0 otherwise

Advantages:
```
(1) Computationally cheaper than Tanh
(2) Saturate for magnitudes of z greater than 1
```
ReLU

Mathematical expression: relu(z) = max(z, 0)

1st order derivative: relu'(z) = 1 if z > 0; 0 otherwise

Advantages:
```
(1) Does not saturate even for large values of z
(2) Found much success in computer vision applications
```
Leaky ReLU

Mathematical expression: leaky(z) = max(z, k dot z) where 0 < k < 1

1st order derivative: relu'(z) = 1 if z > 0; k otherwise

Advantages:
```
(1) Allows propagation of error for non-positive z which ReLU doesn't
```
This paper explains some fun activation function. You may consider to read it.
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...