Why multiply the error by the derivative of the sigmoid in neural networks?

。_饼干妹妹 提交于 2019-12-05 18:06:56
Truffle

Look at this image. If the sigmoid functions gives you a HIGH or LOW value(Pretty good confidence), the derivative of that value is LOW. If you get a value at the steepest slope(0.5), the derivative of that value is HIGH.

When the function gives us a bad prediction, we want to change our weights by a higher number, and on the contrary, if the prediction is good(High confidence), we do NOT want to change our weights much.

First of all, this line is correct:

l1_delta = l1_error * nonlin(l1, True)

The total error from the next layer l1_error is multiplied by the derivative of the current layer (here I consider a sigmoid a separate layer to simplify backpropagation flow). It's called a chain rule.

The quote about "network confidence" may indeed be confusing for a novice learner. What they mean here is probabilistic interpretation of the sigmoid function. Sigmoid (or in general softmax) is very often the last layer in classification problems: sigmoid outputs a value between [0, 1], which can be seen as a probability or confidence of class 0 or class 1.

In this interpretation, sigmoid=0.001 is high confidence of class 0, which corresponds to small gradient and small update to the network, sigmoid=0.999 is high confidence of class 1 and sigmoid=0.499 is low confidence of any class.

Note that in your example, sigmoid is the last layer, so you can look at this network as doing binary classification, hence the interpretation above makes sense.

If you consider a sigmoid activation in the hidden layers, confidence interpretation is more questionable (though one can ask, how confident a particular neuron is). But error propagation formula still holds, because the chain rule holds.

Surely it would just be better to cube the l1_error if you wanted to emphasise the error?

Here's an important note. The big success of neural networks over the last several years is, at least partially, due to use of ReLu instead of sigmoid in the hidden layers, exactly because it's better not to saturate the gradient. This is known as vanishing gradient problem. So, on the contrary, you generally don't want to emphasise the error in backprop.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!