问题
A simple neural network I found had the layers w1, Relu, and w2. I tried to add a new weight layer in the middle and a second Relu after it. So, the layers are as follows w1, Relu, w_mid, Relu, and w2.
It is much much slower than the original 3 layer network if it works at all. I'm not sure if everything is getting a forward pass and if back prop is working across every part it is supposed to.
The neural network is from this link. It is the third block of code down the page.
This is the code I changed.
Below it is the original.
import torch
dtype = torch.float
device = torch.device("cpu")
#device = torch.device("cuda:0") # Uncomment this to run on GPU
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 250, 250, 10
# Create random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)
# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype)
w_mid = torch.randn(H, H, device=device, dtype=dtype)
w2 = torch.randn(H, D_out, device=device, dtype=dtype)
learning_rate = 1e-5
for t in range(5000):
# Forward pass: compute predicted y
h = x.mm(w1)
h_relu = h.clamp(min=0)
k = h_relu.mm(w_mid)
k_relu = k.clamp(min=0)
y_pred = k_relu.mm(w2)
# Compute and print loss
loss = (y_pred - y).pow(2).sum().item()
if t % 1000 == 0:
print(t, loss)
# Backprop to compute gradients of w1, mid, and w2 with respect to loss
grad_y_pred = (y_pred - y) * 2
grad_w2 = k_relu.t().mm(grad_y_pred)
grad_k_relu = grad_y_pred.mm(w2.t())
grad_k = grad_k_relu.clone()
grad_k[k < 0] = 0
grad_mid = h_relu.t().mm(grad_k)
grad_h_relu = grad_k.mm(w1.t())
grad_h = grad_h_relu.clone()
grad_h[h < 0] = 0
grad_w1 = x.t().mm(grad_h)
# Update weights
w1 -= learning_rate * grad_w1
w_mid -= learning_rate * grad_mid
w2 -= learning_rate * grad_w2
The loss is ..
0 1904074240.0
1000 639.4848022460938
2000 639.4848022460938
3000 639.4848022460938
4000 639.4848022460938
This is the original code from the Pytorch website.
import torch
dtype = torch.float
#device = torch.device("cpu")
device = torch.device("cuda:0") # Uncomment this to run on GPU
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)
# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype)
w2 = torch.randn(H, D_out, device=device, dtype=dtype)
learning_rate = 1e-6
for t in range(500):
# Forward pass: compute predicted y
h = x.mm(w1)
h_relu = h.clamp(min=0)
y_pred = h_relu.mm(w2)
# Compute and print loss
loss = (y_pred - y).pow(2).sum().item()
if t % 100 == 99:
print(t, loss)
# Backprop to compute gradients of w1 and w2 with respect to loss
grad_y_pred = 2.0 * (y_pred - y)
grad_w2 = h_relu.t().mm(grad_y_pred)
grad_h_relu = grad_y_pred.mm(w2.t())
grad_h = grad_h_relu.clone()
grad_h[h < 0] = 0
grad_w1 = x.t().mm(grad_h)
# Update weights using gradient descent
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2
回答1:
The calculations for the gradient of h_relu is not correct.
grad_h_relu = grad_k.mm(w1.t())
That should be a w_mid not w1:
grad_h_relu = grad_k.mm(w_mid.t())
Other than that, the calculations are correct, but you should lower the learning rate, as the gradients are very big at the beginning, making the weights very large and that leads to overflowing values (infinity), which in turn produce NaN losses and gradients. This is known as exploding gradients.
In your example a learning rate of 1e-8 seems to work.
来源:https://stackoverflow.com/questions/62272830/i-modified-a-few-layers-to-an-example-of-a-neural-network-just-to-see-if-i-could