I modified a few layers to an example of a neural network just to see if I could. What's wrong with it?

我怕爱的太早我们不能终老 提交于 2020-07-23 06:43:11

问题


A simple neural network I found had the layers w1, Relu, and w2. I tried to add a new weight layer in the middle and a second Relu after it. So, the layers are as follows w1, Relu, w_mid, Relu, and w2.
It is much much slower than the original 3 layer network if it works at all. I'm not sure if everything is getting a forward pass and if back prop is working across every part it is supposed to.
The neural network is from this link. It is the third block of code down the page.

This is the code I changed.
Below it is the original.

    import torch
    dtype = torch.float
    device = torch.device("cpu")
    #device = torch.device("cuda:0") # Uncomment this to run on GPU

    # N is batch size; D_in is input dimension;
    # H is hidden dimension; D_out is output dimension.
    N, D_in, H, D_out = 64, 250, 250, 10

    # Create random input and output data
    x = torch.randn(N, D_in, device=device, dtype=dtype)
    y = torch.randn(N, D_out, device=device, dtype=dtype)

    # Randomly initialize weights
    w1 = torch.randn(D_in, H, device=device, dtype=dtype)
    w_mid = torch.randn(H, H, device=device, dtype=dtype)
    w2 = torch.randn(H, D_out, device=device, dtype=dtype)

    learning_rate = 1e-5
    for t in range(5000):
        # Forward pass: compute predicted y
        h = x.mm(w1)
        h_relu = h.clamp(min=0)
        k = h_relu.mm(w_mid)
        k_relu = k.clamp(min=0)
        y_pred = k_relu.mm(w2)


        # Compute and print loss
        loss = (y_pred - y).pow(2).sum().item()
        if t % 1000 == 0:
            print(t, loss)

        # Backprop to compute gradients of w1, mid, and w2 with respect to loss
        grad_y_pred = (y_pred - y) * 2
        grad_w2 = k_relu.t().mm(grad_y_pred)
        grad_k_relu = grad_y_pred.mm(w2.t())
        grad_k = grad_k_relu.clone()
        grad_k[k < 0] = 0
        grad_mid = h_relu.t().mm(grad_k)
        grad_h_relu = grad_k.mm(w1.t())
        grad_h = grad_h_relu.clone()
        grad_h[h < 0] = 0
        grad_w1 = x.t().mm(grad_h)

        # Update weights
        w1 -= learning_rate * grad_w1
        w_mid -= learning_rate * grad_mid
        w2 -= learning_rate * grad_w2  

The loss is ..
0 1904074240.0
1000 639.4848022460938
2000 639.4848022460938
3000 639.4848022460938
4000 639.4848022460938

This is the original code from the Pytorch website.

    import torch


    dtype = torch.float
    #device = torch.device("cpu")
    device = torch.device("cuda:0") # Uncomment this to run on GPU

    # N is batch size; D_in is input dimension;
    # H is hidden dimension; D_out is output dimension.
    N, D_in, H, D_out = 64, 1000, 100, 10

    # Create random input and output data
    x = torch.randn(N, D_in, device=device, dtype=dtype)
    y = torch.randn(N, D_out, device=device, dtype=dtype)

    # Randomly initialize weights
    w1 = torch.randn(D_in, H, device=device, dtype=dtype)
    w2 = torch.randn(H, D_out, device=device, dtype=dtype)

    learning_rate = 1e-6
    for t in range(500):
        # Forward pass: compute predicted y
        h = x.mm(w1)
        h_relu = h.clamp(min=0)
        y_pred = h_relu.mm(w2)

        # Compute and print loss
        loss = (y_pred - y).pow(2).sum().item()
        if t % 100 == 99:
            print(t, loss)

        # Backprop to compute gradients of w1 and w2 with respect to loss
        grad_y_pred = 2.0 * (y_pred - y)
        grad_w2 = h_relu.t().mm(grad_y_pred)
        grad_h_relu = grad_y_pred.mm(w2.t())
        grad_h = grad_h_relu.clone()
        grad_h[h < 0] = 0
        grad_w1 = x.t().mm(grad_h)

        # Update weights using gradient descent
        w1 -= learning_rate * grad_w1
        w2 -= learning_rate * grad_w2

回答1:


The calculations for the gradient of h_relu is not correct.

grad_h_relu = grad_k.mm(w1.t())

That should be a w_mid not w1:

grad_h_relu = grad_k.mm(w_mid.t())

Other than that, the calculations are correct, but you should lower the learning rate, as the gradients are very big at the beginning, making the weights very large and that leads to overflowing values (infinity), which in turn produce NaN losses and gradients. This is known as exploding gradients.

In your example a learning rate of 1e-8 seems to work.



来源:https://stackoverflow.com/questions/62272830/i-modified-a-few-layers-to-an-example-of-a-neural-network-just-to-see-if-i-could

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!