问题
Here I'm attempting to implement a neural network with a single hidden layer to classify two training examples. This network utilizes the sigmoid activation function.
The layers dimensions and weights are as follows :
X : 2X4
w1 : 2X3
l1 : 4X3
w2 : 2X4
Y : 2X3
I'm experiencing an issue in back propagation where the matrix dimensions are not correct. This code :
import numpy as np
M = 2
learning_rate = 0.0001
X_train = np.asarray([[1,1,1,1] , [0,0,0,0]])
Y_train = np.asarray([[1,1,1] , [0,0,0]])
X_trainT = X_train.T
Y_trainT = Y_train.T
A2_sig = 0;
A1_sig = 0;
def sigmoid(z):
s = 1 / (1 + np.exp(-z))
return s
def forwardProp() :
global A2_sig, A1_sig;
w1=np.random.uniform(low=-1, high=1, size=(2, 2))
b1=np.random.uniform(low=1, high=1, size=(2, 1))
w1 = np.concatenate((w1 , b1) , axis=1)
A1_dot = np.dot(X_trainT , w1)
A1_sig = sigmoid(A1_dot).T
w2=np.random.uniform(low=-1, high=1, size=(4, 1))
b2=np.random.uniform(low=1, high=1, size=(4, 1))
w2 = np.concatenate((w2 , b2) , axis=1)
A2_dot = np.dot(A1_sig, w2)
A2_sig = sigmoid(A2_dot)
def backProp() :
global A2_sig;
global A1_sig;
error1 = np.dot((A2_sig - Y_trainT).T, A1_sig / M)
print(A1_sig)
print(error1)
error2 = A1_sig.T - error1
forwardProp()
backProp()
Returns error :
ValueError Traceback (most recent call last)
<ipython-input-605-5aa61e60051c> in <module>()
45
46 forwardProp()
---> 47 backProp()
48
49 # dw2 = np.dot((Y_trainT - A2_sig))
<ipython-input-605-5aa61e60051c> in backProp()
42 print(A1_sig)
43 print(error1)
---> 44 error2 = A1_sig.T - error1
45
46 forwardProp()
ValueError: operands could not be broadcast together with shapes (4,3) (2,4)
How to compute error for previous layer ?
Update :
import numpy as np
M = 2
learning_rate = 0.0001
X_train = np.asarray([[1,1,1,1] , [0,0,0,0]])
Y_train = np.asarray([[1,1,1] , [0,0,0]])
X_trainT = X_train.T
Y_trainT = Y_train.T
A2_sig = 0;
A1_sig = 0;
def sigmoid(z):
s = 1 / (1 + np.exp(-z))
return s
A1_sig = 0;
A2_sig = 0;
def forwardProp() :
global A2_sig, A1_sig;
w1=np.random.uniform(low=-1, high=1, size=(4, 2))
b1=np.random.uniform(low=1, high=1, size=(2, 1))
A1_dot = np.dot(X_train , w1) + b1
A1_sig = sigmoid(A1_dot).T
w2=np.random.uniform(low=-1, high=1, size=(2, 3))
b2=np.random.uniform(low=1, high=1, size=(2, 1))
A2_dot = np.dot(A1_dot , w2) + b2
A2_sig = sigmoid(A2_dot)
return(A2_sig)
def backProp() :
global A2_sig;
global A1_sig;
error1 = np.dot((A2_sig - Y_trainT.T).T , A1_sig / M)
error2 = error1 - A1_sig
return(error1)
print(forwardProp())
print(backProp())
Returns error :
ValueError Traceback (most recent call last)
<ipython-input-664-25e99255981f> in <module>()
47
48 print(forwardProp())
---> 49 print(backProp())
<ipython-input-664-25e99255981f> in backProp()
42
43 error1 = np.dot((A2_sig - Y_trainT.T).T , A1_sig / M)
---> 44 error2 = error1.T - A1_sig
45
46 return(error1)
ValueError: operands could not be broadcast together with shapes (2,3) (2,2)
Have incorrectly set matrix dimensions ?
回答1:
Code review
I have examined your latest version and noticed the following mistakes:
- (minor) In the forward pass,
A1_sig
is never used, maybe it's just a typo. (major) In the backward pass, I'm not sure what you intended to use as a loss function. From the code it looks like a L2 loss:
error1 = np.dot((A2_sig - Y_trainT.T).T , A1_sig / M)
The key expression is this:
A2_sig - Y_trainT.T
(though maybe I just don't get your idea).However, you mention that you're doing multi-label classification, most probably binary. In this case, L2 loss is a poor choice (see this post if you're interested why). Instead, use logistic regression loss, a.k.a. cross-entropy. In your case, it's binary.
(critical) In the backward pass, you've skipped the sigmoid layer. The following line take the loss error and passes it through the linear layer:
error1 = np.dot((A2_sig - Y_trainT.T).T , A1_sig / M)
... while the forward pass is going through the sigmoid activation after the linear layer (which is correct). At this point,
error1
doesn't make any sense and its dimensions don't matter.
Solution
I don't like your variables naming, it's very easy to get confused. So I changed it and reorganized the code a bit. Here's the converging NN:
import numpy as np
def sigmoid(z):
return 1 / (1 + np.exp(-z))
X_train = np.asarray([[1, 1, 1, 1], [0, 0, 0, 0]]).T
Y_train = np.asarray([[1, 1, 1], [0, 0, 0]]).T
hidden_size = 2
output_size = 3
learning_rate = 0.1
w1 = np.random.randn(hidden_size, 4) * 0.1
b1 = np.zeros((hidden_size, 1))
w2 = np.random.randn(output_size, hidden_size) * 0.1
b2 = np.zeros((output_size, 1))
for i in xrange(50):
# forward pass
Z1 = np.dot(w1, X_train) + b1
A1 = sigmoid(Z1)
Z2 = np.dot(w2, A1) + b2
A2 = sigmoid(Z2)
cost = -np.mean(Y_train * np.log(A2) + (1 - Y_train) * np.log(1 - A2))
print(cost)
# backward pass
dA2 = (A2 - Y_train) / (A2 * (1 - A2))
dZ2 = np.multiply(dA2, A2 * (1 - A2))
dw2 = np.dot(dZ2, A1.T)
db2 = np.sum(dZ2, axis=1, keepdims=True)
dA1 = np.dot(w2.T, dZ2)
dZ1 = np.multiply(dA1, A1 * (1 - A1))
dw1 = np.dot(dZ1, X_train.T)
db1 = np.sum(dZ1, axis=1, keepdims=True)
w1 = w1 - learning_rate * dw1
w2 = w2 - learning_rate * dw2
b1 = b1 - learning_rate * db1
b2 = b2 - learning_rate * db2
回答2:
Your first weight matrix, w1
, should be of shape (n_features, layer_1_size)
, so when you multiply an input, X
of shape (m_examples, n_features)
by w1
, you get an (m_examples, layer_1_size)
matrix. This gets run through the activation of layer 1 and then fed into layer 2 which should have a weight matrix of shape (layer_1_size, output_size)
, where output_size=3
since you are doing multi-label classification for 3 classes. As you can see, the point is to convert each layer's input into a shape that fits the number of neurons in that layer, or in other words, each input to a layer must feed into every neuron in that layer.
I wouldn't take the transpose of your layer inputs as you have it, I would shape the weight matrices as described so you can compute np.dot(X, w1)
, etc.
It also looks like you are not handling your biases correctly. When we compute Z = np.dot(w1,X) + b1
, b1
should be broadcast so that it is added to every column of the product of w1
and X
. This will not happen if you append b1
to your weight matrix as you have it. Rather you should add a column of ones
to your input matrix and an additional row to your weight matrix, so the bias terms sit in that row of your weight matrix and the ones
in your input ensure they get added everywhere. In this setup you don't need separate b1
, b2
terms.
X_train = np.c_(X_train, np.ones(m_examples))
and remember to add one more row to your weights, so w1
should have shape (n_features+1, layer_1_size)
.
Update for backpropagation:
The goal of backpropagation is to compute the gradient of your error function with respect to your weights and biases and use each result to update each weights matrix and each bias vector.
So you need dE/dw2
, dE/db2
, dE/dw1
, and dE/db1
so you can apply the updates:
w2 <- w2 - learning_rate * dE/dw2
b2 <- b2 - learning_rate * dE/db2
w1 <- w1 - learning_rate * dE/dw1
b1 <- b1 - learning_rate * dE/db1
Since you are doing multilabel classification, you should be using binary crossentropy loss:
You can compute dE/dw2
using the chain rule:
dE/dw2 = (dE/dA2) * (dA/dZ2) * (dZ2/dw2)
I am using Z
for your A2_dot
since the activation hasn't been applied yet, and I'm using A2
for your A2_sig
.
See Notes on Backpropagation [pdf] for a detailed derivation for crossentropy loss with sigmoid activation. This gives a pointwise derivation, however, whereas we are looking for a vectorized implementation, so you will have to do a bit of work to figure out the correct layout for your matrices. There is also no explicit bias vector, unfortunately.
The expression you have for error1
looks correct, but I would call it dw2
, and I would just use Y_train
instead of taking the transpose twice:
dw2 = (1/m) * np.dot((A2 - Y_train).T , A1)
And you also need db2
which should be:
db2 = (1/m) * np.sum(A2 - Y_train, axis=1, keepdims=True)
You will have to apply the chain rule further to get dw1
and db1
, and I'll leave that to you, but there is a nice derivation in Week 3 of the Neural Networks and Deep Learning Coursera Course.
I can't say much about the line you are getting an error on besides that I don't think you should have that calculation in your backprop code, so it makes sense that the dimensions don't match. You might be thinking of the gradient at the output, but I can't think of any similar expression involving A1
for backprop in this network.
This article has a very nice implementation of a one hidden layer neural net in numpy. It does use softmax at the output, but it has sigmoid activations in the hidden layer and otherwise the difference in calculation is minimal. It should help you calculate dw1
and db1
for the hidden layer. Specifically, look at the expression for delta1
in the section titled "A neural network in practice".
Converting their calculation to the notation we're using, and using a sigmoid at the output instead of softmax, it should look like:
dZ2 = A2 - Y_train
dZ1 = np.dot(dZ2, w2.T) * A1 * (1 - A1) # element-wise product
dw2 = (1/m) * np.dot(dZ2, A1.T)
db2 = (1/m) * np.sum(dZ2, axis=1, keepdims=True)
dw1 = (1/m) * np.dot(dZ1, X_train.T)
db1 = (1/m) * np.sum(dZ1, axis=1, keepdims=True)
来源:https://stackoverflow.com/questions/47844093/matrix-dimensions-not-matching-in-back-propagation