Why multiply the error by the derivative of the sigmoid in neural networks?

Here is the code:

import numpy as np

# sigmoid function
def nonlin(x,deriv=False):
    if(deriv==True):
        return x*(1-x)
    return 1/(1+np.exp(-x))

# input dataset
X = np.array([  [0,0,1],
                [0,1,1],
                [1,0,1],
                [1,1,1] ])

# output dataset            
y = np.array([[0,0,1,1]]).T

# seed random numbers to make calculation
# deterministic (just a good practice)
np.random.seed(1)

# initialize weights randomly with mean 0
syn0 = 2*np.random.random((3,1)) - 1

for iter in xrange(10000):

    # forward propagation
    l0 = X
    l1 = nonlin(np.dot(l0,syn0))

    # how much did we miss?
    l1_error = y - l1

    # multiply how much we missed by the 
    # slope of the sigmoid at the values in l1
    l1_delta = l1_error * nonlin(l1,True)

    # update weights
    syn0 += np.dot(l0.T,l1_delta)

print "Output After Training:"
print l1

Here is the website: http://iamtrask.github.io/2015/07/12/basic-python-network/

Line 36 of the code, the l1 error is multiplied by the derivative of the input dotted with the weights. I have no idea why this is done and have been spending hours trying to figure it out. I just reached the conclusion that this is wrong, but something is telling me that's probably not right considering how many people recommend and use this tutorial as a starting point for learning neural networks.

In the article, they say that

Look at the sigmoid picture again! If the slope was really shallow (close to 0), then the network either had a very high value, or a very low value. This means that the network was quite confident one way or the other. However, if the network guessed something close to (x=0, y=0.5) then it isn't very confident.

I cannot seem to wrap my head around why the highness or lowness of the input into the sigmoid function has anything to do with the confidence. Surely it doesn't matter how high it is, because if the predicted output is low, then it will be really UNconfident, unlike what they said about it should be confident just coz it's high.

Surely it would just be better to cube the l1_error if you wanted to emphasize the error?

This is a real let down considering up to that point it finally looked like I had found a promising way to really intuitively start learning about neural networks, but yet again I was wrong. If you have a good place to start learning where I can understand really easily, it would be appreciated.

Truffle

Look at this image. If the sigmoid functions gives you a HIGH or LOW value(Pretty good confidence), the derivative of that value is LOW. If you get a value at the steepest slope(0.5), the derivative of that value is HIGH.

When the function gives us a bad prediction, we want to change our weights by a higher number, and on the contrary, if the prediction is good(High confidence), we do NOT want to change our weights much.

First of all, this line is correct:

l1_delta = l1_error * nonlin(l1, True)

The total error from the next layer l1_error is multiplied by the derivative of the current layer (here I consider a sigmoid a separate layer to simplify backpropagation flow). It's called a chain rule.

The quote about "network confidence" may indeed be confusing for a novice learner. What they mean here is probabilistic interpretation of the sigmoid function. Sigmoid (or in general softmax) is very often the last layer in classification problems: sigmoid outputs a value between [0, 1], which can be seen as a probability or confidence of class 0 or class 1.

In this interpretation, sigmoid=0.001 is high confidence of class 0, which corresponds to small gradient and small update to the network, sigmoid=0.999 is high confidence of class 1 and sigmoid=0.499 is low confidence of any class.

Note that in your example, sigmoid is the last layer, so you can look at this network as doing binary classification, hence the interpretation above makes sense.

If you consider a sigmoid activation in the hidden layers, confidence interpretation is more questionable (though one can ask, how confident a particular neuron is). But error propagation formula still holds, because the chain rule holds.

Surely it would just be better to cube the l1_error if you wanted to emphasise the error?

Here's an important note. The big success of neural networks over the last several years is, at least partially, due to use of ReLu instead of sigmoid in the hidden layers, exactly because it's better not to saturate the gradient. This is known as vanishing gradient problem. So, on the contrary, you generally don't want to emphasise the error in backprop.

来源：https://stackoverflow.com/questions/45787261/why-multiply-the-error-by-the-derivative-of-the-sigmoid-in-neural-networks

标签

python

numpy

machine-learning

neural-network

sigmoid