Using the notations from Backpropagation calculus | Deep learning, chapter 4, I have this back-propagation code for a 4-layer (i.e. 2 hidden layers) neural network:
def sigmoid_prime(z): 
    return z * (1-z)  # because σ'(x) = σ(x) (1 - σ(x))
def train(self, input_vector, target_vector):
    a = np.array(input_vector, ndmin=2).T
    y = np.array(target_vector, ndmin=2).T
    # forward
    A = [a]  
    for k in range(3):
        a = sigmoid(np.dot(self.weights[k], a))  # zero bias here just for simplicity
        A.append(a)
    # Now A has 4 elements: the input vector + the 3 outputs vectors
    # back-propagation
    delta = a - y
    for k in [2, 1, 0]:
        tmp = delta * sigmoid_prime(A[k+1])
        delta = np.dot(self.weights[k].T, tmp)  # (1)  <---- HERE
        self.weights[k] -= self.learning_rate * np.dot(tmp, A[k].T) 
It works, but:
- the accuracy at the end (for my use case: MNIST digit recognition) is just ok, but not very good. It is much better (i.e. the convergence is much better) when the line (1) is replaced by: - delta = np.dot(self.weights[k].T, delta) # (2)
- the code from Machine Learning with Python: Training and Testing the Neural Network with MNIST data set also suggests: - delta = np.dot(self.weights[k].T, delta)- instead of: - delta = np.dot(self.weights[k].T, tmp)- (With the notations of this article, it is: - output_errors = np.dot(self.weights_matrices[layer_index-1].T, output_errors)- ) 
These 2 arguments seem to be concordant: code (2) is better than code (1).
However, the math seem to show the contrary (see video here; another detail: note that my loss function is multiplied by 1/2 whereas it's not on the video):
Question: which one is correct: the implementation (1) or (2)?
In LaTeX:
$$\frac{\partial{C}}{\partial{w^{L-1}}} = \frac{\partial{z^{L-1}}}{\partial{w^{L-1}}} \frac{\partial{a^{L-1}}}{\partial{z^{L-1}}} \frac{\partial{C}}{\partial{a^{L-1}}}=a^{L-2} \sigma'(z^{L-1}) \times w^L \sigma'(z^L)(a^L-y) $$
$$\frac{\partial{C}}{\partial{w^L}} = \frac{\partial{z^L}}{\partial{w^L}} \frac{\partial{a^L}}{\partial{z^L}} \frac{\partial{C}}{\partial{a^L}}=a^{L-1} \sigma'(z^L)(a^L-y)$$
$$\frac{\partial{C}}{\partial{a^{L-1}}} = \frac{\partial{z^L}}{\partial{a^{L-1}}} \frac{\partial{a^L}}{\partial{z^L}} \frac{\partial{C}}{\partial{a^L}}=w^L \sigma'(z^L)(a^L-y)$$
I spent two days to analyze this problem, I filled a few pages of notebook with partial derivative computations... and I can confirm:
- the maths written in LaTeX in the question are correct
- the code (1) is the correct one, and it agrees with the math computations: - delta = a - y for k in [2, 1, 0]: tmp = delta * sigmoid_prime(A[k+1]) delta = np.dot(self.weights[k].T, tmp) self.weights[k] -= self.learning_rate * np.dot(tmp, A[k].T)
- code (2) is wrong: - delta = a - y for k in [2, 1, 0]: tmp = delta * sigmoid_prime(A[k+1]) delta = np.dot(self.weights[k].T, delta) # WRONG HERE self.weights[k] -= self.learning_rate * np.dot(tmp, A[k].T)- and there a slight mistake in Machine Learning with Python: Training and Testing the Neural Network with MNIST data set: - output_errors = np.dot(self.weights_matrices[layer_index-1].T, output_errors)- should be - output_errors = np.dot(self.weights_matrices[layer_index-1].T, output_errors * out_vector * (1.0 - out_vector))
Now the difficult part that took me days to realize:
- Apparently the code (2) has a far better convergence than code (1), that's why I mislead into thinking code (2) was correct and code (1) was wrong 
- ... But in fact that's just a coincidence because the - learning_ratewas set too low. Here is the reason: when using code (2), the parameter- deltais growing much faster (- print np.linalg.norm(delta)helps to see this) than with the code (1).
- Thus "incorrect code (2)" just compensated the "too slow learning rate" by having a bigger - deltaparameter, and it lead, in some cases, to an apparently faster convergence.
Now solved!
来源:https://stackoverflow.com/questions/53287032/multi-layer-neural-network-back-propagation-formula-using-stochastic-gradient-d
