Negative Speed Gain Using Numba Vectorize target='cuda'

纵然是瞬间 提交于 2019-12-04 17:33:31

This question is also interesting for me. I've tried your code and got similar results. To somehow investigate this issue I've wrote the CUDA kernel using cuda.jit and add it in your code:

import numpy as np
from timeit import default_timer as timer
from numba import vectorize, cuda

N = 16*50000 #32000000
blockdim = 16, 1
griddim = int(N/blockdim[0]), 1

@cuda.jit("void(float32[:], float32[:])")
def VectorAdd_GPU(a, b):
    i = cuda.grid(1)
    if i < N:
        a[i] += b[i]

@vectorize("float32(float32, float32)", target='cpu')
def VectorAdd(a,b):
    return a + b


A = np.ones(N, dtype=np.float32)
B = np.ones(N, dtype=np.float32)
C = np.zeros(N, dtype=np.float32)

start = timer()
C = VectorAdd(A, B)
vectoradd_time = timer() - start
print("VectorAdd took %f seconds" % vectoradd_time)

start = timer()
d_A = cuda.to_device(A)
d_B = cuda.to_device(B)
VectorAdd_GPU[griddim,blockdim](d_A, d_B)
C = d_A.copy_to_host()
vectoradd_time = timer() - start
print("VectorAdd_GPU took %f seconds" % vectoradd_time)

print("C[:5] = " + str(C[:5]))
print("C[-5:] = " + str(C[-5:]))

In this 'benchmark' I also take into account the time for copying of arrays from host to device and from device to host. In this case the GPU function is slowly than CPU one.

For the case above:

CPU - 0.0033; 
GPU - 0.0096; 
Vectorize (target='cuda') - 0.15 (for my PC).

If the copying time is not accounted:

GPU - 0.000245

So, what I have learned, (1) The copying from host to device and from device to host is time-consuming. It is obvious and well-known. (2) I do not know the reason but @vectorize can significantly slowing down the calculations on GPU. (3) It is better to use self-written kernels (and of course minimize the memory copying).

By the way I have also tested the @cuda.jit by solving heat-conduction equation by explicit finite-difference scheme and found that for this case python program execution time is comparable with C program and provide about 100 times speedup. It is because, fortunately in this case you can do many iterations without data exchange between host and device.

UPD. Used Software & Hardware: Win7 64bit, CPU: Intel Core2 Quad 3GHz, GPU: NVIDIA GeForce GTX 580.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!