Why pytorch training on CUDA works much slower than in CPU?

我与影子孤独终老i 提交于 2021-01-29 09:49:03

问题


I guess i have made something in folowing simple neural network with PyTorch, because this runs much slower with CUDA then in CPU, can you find the mistake pls. The using function like

    def backward(ctx, input):

        return backward_sigm(ctx, input)

seems have no real impact on preformance

import torch
import torch.nn as nn
import torch.nn.functional as f


dname = 'cuda:0'
dname = 'cpu'




device = torch.device(dname)


print(torch.version.cuda)

def forward_sigm(ctx, input):

    sigm = 1 / (1 + torch.exp(-input))

    ctx.save_for_backward(sigm)

    return sigm

def forward_step(ctx, input):

    return  torch.tensor(input > 0.5, dtype = torch.float32, device = device)


def backward_sigm(ctx, grad_output):

    sigm, = ctx.saved_tensors

    return grad_output * sigm * (1-sigm)


def backward_step(ctx, grad_output):

    return grad_output




class StepAF(torch.autograd.Function):
    @staticmethod
    def forward(ctx, input):
        return forward_sigm(ctx, input)


    @staticmethod
    def backward(ctx, input):

        return backward_sigm(ctx, input)
    #else return grad_output



class StepNN(torch.nn.Module):

    def __init__(self, input_size, hidden_size, output_size):
        super(StepNN, self).__init__()
        self.linear1 = torch.nn.Linear(input_size, hidden_size)
        #self.linear1.cuda()
        self.linear2 = torch.nn.Linear(hidden_size, output_size)
        #self.linear2.cuda()

        #self.StepAF = StepAF.apply



    def forward(self,x):

        h_line_1 = self.linear1(x)

        h_thrash_1 = StepAF.apply(h_line_1)

        h_line_2 = self.linear2(h_thrash_1)

        output = StepAF.apply(h_line_2)

        return output


inputs = torch.tensor( [[1,0,1,0],[1,0,0,1],[0,1,0,1],[0,1,1,0],[1,0,0,0],[0,0,0,1],[1,1,0,1],[0,1,0,0],], dtype = torch.float32, device = device)

expected = torch.tensor( [[1,0,0],[1,0,0],[0,1,0],[0,1,0],[1,0,0],[0,0,1],[0,1,0],[0,0,1],], dtype = torch.float32, device = device)


nn = StepNN(4,8,3)


#print(*(x for x in nn.parameters()))

criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(nn.parameters(), lr=1e-3)

steps = 50000

print_steps = steps // 20

good_loss = 1e-5

for t in range(steps):

    output = nn(inputs)
    loss = criterion(output, expected)



    if t % print_steps == 0:
        print('step ',t, ', loss :' , loss.item())

    if loss < good_loss:
        print('step ',t, ', loss :' , loss.item())
        break

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()



test = torch.tensor( [[0,1,0,1],[0,1,1,0],[1,0,1,0],[1,1,0,1],], dtype = torch.float32, device=device)


print(nn(test))

回答1:


Unless you have large enough data, you won't see any performance improvement while using GPU. The problem is that GPUs use parallel processing, so unless you have large amounts of data, the CPU can process the samples almost as fast as the GPU.

As far as I can see in your example, you are using 8 samples of size (4, 1). I would imagine maybe when having over hundreds or thousands of samples, then you would see the performance improvement on a GPU. In your case, the sample size is (4, 1), and the hidden layer size is 8, so the CPU can perform the calculations fairly quickly.

There are lots of example notebooks online of people using MNIST data (it has around 60000 images for training), so you could load one in maybe Google Colab and then try training on the CPU and then on GPU and observe the training times. You could try this link for example. It uses TensorFlow instead of PyTorch but it will give you an idea of the performance improvement of a GPU.

Note : If you haven't used Google Colab before, then you need to change the runtime type (None for CPU and GPU for GPU) in the runtime menu at the top.

Also, I will post the results from this notebook here itself (look at the time mentioned in the brackets, and if you run it, you can see firsthand how fast it runs) :

On CPU :

INFO:tensorflow:loss = 294.3736, step = 1
INFO:tensorflow:loss = 28.285727, step = 101 (23.769 sec)
INFO:tensorflow:loss = 23.518856, step = 201 (24.128 sec)

On GPU :

INFO:tensorflow:loss = 295.08328, step = 0
INFO:tensorflow:loss = 47.37291, step = 100 (4.709 sec)
INFO:tensorflow:loss = 23.31364, step = 200 (4.581 sec)
INFO:tensorflow:loss = 9.980572, step = 300 (4.572 sec)
INFO:tensorflow:loss = 17.769928, step = 400 (4.560 sec)
INFO:tensorflow:loss = 16.345463, step = 500 (4.531 sec)


来源:https://stackoverflow.com/questions/56509469/why-pytorch-training-on-cuda-works-much-slower-than-in-cpu

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!