Numba and guvectorize for CUDA target: Code running slower than expected

问题

Notable details

Large datasets (10 million x 5), (200 x 10 million x 5)
Numpy mostly
Takes longer after every run
Using Spyder3
Windows 10

First thing is attempting to use guvectorize with the following function. I am passing in a bunch of numpy arrays and attempting to use them to multiply across two of the arrays. This works if run with a target other than cuda. However, when switched to cuda it results in an unknown error being:

File "C:\ProgramData\Anaconda3\lib\site-packages\numba\cuda\decorators.py", >line 82, in jitwrapper debug=debug)

TypeError: init() got an unexpected keyword argument 'debug'

After following all that I could find from this error, I hit nothing but dead ends. I'm guessing it's a really simple fix that I'm completely missing but oh well. It should also be said that this error only occurs after running it once and having it crash due to memory overload.

os.environ["NUMBA_ENABLE_CUDASIM"] = "1"

os.environ["CUDA_VISIBLE_DEVICES"] = "10DE 1B06 63933842"
...

All of the arrays are numpy

@guvectorize(['void(int64, float64[:,:], float64[:,:], float64[:,:,:], 
int64, int64, float64[:,:,:])'], '(),(m,o),(m,o),(n,m,o),(),() -> (n,m,o)', 
target='cuda', nopython=True)
def cVestDiscount (ed, orCV, vals, discount, n, rowCount, cv):
    for as_of_date in range(0,ed):
        for ID in range(0,rowCount):
            for num in range(0,n):
                cv[as_of_date][ID][num] = orCV[ID][num] * discount[as_of_date][ID][num]

Attempting to run the code with nvprofiler in command line results in the following error:

Warning: Unified Memory Profiling is not supported on the current configuration because a pair of devices without peer-to-peer support is detected on this ?multi-GPU setup. When peer mappings are not available, system falls back to using zero-copy memory. It can cause kernels, which access unified memory, to run slower. More details can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-managed-memory

I realized that I am using SLI enabled graphics cards (both cards are identical, evga gtx 1080ti, and have the same device id), so I disabled SLI and added the "CUDA_VISIBLE_DEVICES" line to try and limit to other one card, but am left with the same results.

I can still run the code with nvprof, but the cuda function is slow compared to njit(parallel=True) and prange. By using a smaller data size we can run the code, but it is slower than target='parallel' and target='cpu'.

Why is cuda so much slower, and what do these errors mean?

Thanks for the help!

EDIT: Here is a working example of the code:

import numpy as np
from numba import guvectorize
import time
from timeit import default_timer as timer


@guvectorize(['void(int64, float64[:,:], float64[:,:,:], int64, int64, float64[:,:,:])'], '(),(m,o),(n,m,o),(),() -> (n,m,o)', target='cuda', nopython=True)
def cVestDiscount (countRow, multBy, discount, n, countCol, cv):
    for as_of_date in range(0,countRow):
        for ID in range(0,countCol):
            for num in range(0,n):
                cv[as_of_date][ID][num] = multBy[ID][num] * discount[as_of_date][ID][num]

countRow = np.int64(100)
multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape=(100,4000,5), dtype=np.float64)
func_start = timer()
cv = cVestDiscount(countRow, multBy, discount, n, countCol, cv)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))

I am able to run the code in cuda using a gtx 1080ti, however, it is much slower than running it in parallel or cpu. I've looked at other posts pertaining to guvectorize, yet none of them have helped me understand what is and isn't optimal to run in guvectorize. Is there any way to make this code 'cuda friendly', or is only doing multiplication across arrays too simple for any benefit to be seen?

回答1:

First of all, the basic operation you have shown is to take two matrices, transfer them to the GPU, do some elementwise multiplications to produce a 3rd array, and pass that 3rd array back to the host.

It may be possible to make a numba/cuda guvectorize (or cuda.jit kernel) implementation that might run faster than a naive serial python implementation, but I doubt it would be possible to exceed the performance of a well-written host code (e.g. using some parallelization method, such as guvectorize) to do the same thing. This is because the arithmetic intensity per byte transferred between host and device is just too low. This operation is far too simple.

Secondly, it's important, I believe, to start out with an understanding of what numba vectorize and guvectorize are intended to do. The basic principle is to write the ufunc definition from the standpoint of "what will a worker do?" and then allow numba to spin up multiple workers from that. The way that you instruct numba to spin up multiple workers is to pass a data set that is larger than the signatures you have given. It should be noted that numba does not know how to parallelize a for-loop inside a ufunc definition. It gets parallel "strength" by taking your ufunc definition and running it among parallel workers, where each worker handles a "slice" of the data, but runs your entire ufunc definition on that slice. As some additional reading, I've covered some of this ground here also.

So a problem we have in your realization is that you have written a signature (and ufunc) which maps the entire input data set to a single worker. As @talonmies showed, your underlying kernel is being spun up with a total of 64 threads/workers (which is far to small to be interesting on a GPU, even apart from the above statements about arithmetic intensity), but I suspect in fact that 64 is actually just a numba minimum threadblock size, and in fact only 1 thread in that threadblock is doing any useful calculation work. That one thread is executing your entire ufunc, including all for-loops, in a serial fashion.

That's obviously not what anyone would intend for rational use of vectorize or guvectorize.

So let's revisit what you are trying to do. Ultimately your ufunc wants to multiply an input value from one array by an input value from another array and store the result in a 3rd array. We want to repeat that process many times. If all 3 array sizes were the same, we could actually realize this with vectorize and would not even have to resort to the more complicated guvectorize. Let's compare that approach to your original, focusing on the CUDA kernel execution. Here's a worked example, where t14.py is your original code, run with the profiler, and t15.py is a vectorize version of it, acknowledging that we have changed the size of your multBy array to match cv and discount:

$ nvprof --print-gpu-trace python t14.py
==4145== NVPROF is profiling process 4145, command: python t14.py
Function: discount factor cumVest duration (seconds):1.24354910851
==4145== Profiling application: python t14.py
==4145== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
312.36ms  1.2160us                    -               -         -         -         -        8B  6.2742MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
312.81ms  27.392us                    -               -         -         -         -  156.25KB  5.4400GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
313.52ms  5.8696ms                    -               -         -         -         -  15.259MB  2.5387GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
319.74ms  1.0880us                    -               -         -         -         -        8B  7.0123MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
319.93ms     896ns                    -               -         -         -         -        8B  8.5149MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
321.40ms  1.22538s              (1 1 1)        (64 1 1)        63        0B        0B         -           -           -           -  Quadro K2000 (0         1         7  cudapy::__main__::__gufunc_cVestDiscount$242(Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>) [37]
1.54678s  7.1816ms                    -               -         -         -         -  15.259MB  2.0749GB/s      Device    Pageable  Quadro K2000 (0         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$ cat t15.py
import numpy as np
from numba import guvectorize,vectorize
import time
from timeit import default_timer as timer


@vectorize(['float64(float64, float64)'], target='cuda')
def cVestDiscount (a, b):
    return a * b

discount = np.float64(np.arange(2000000).reshape(100,4000,5))
multBy = np.full_like(discount, 1)
cv = np.empty_like(discount)
func_start = timer()
cv = cVestDiscount(multBy, discount)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
$ nvprof --print-gpu-trace python t15.py
==4167== NVPROF is profiling process 4167, command: python t15.py
Function: discount factor cumVest duration (seconds):0.37507891655
==4167== Profiling application: python t15.py
==4167== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
193.92ms  6.2729ms                    -               -         -         -         -  15.259MB  2.3755GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
201.09ms  5.7101ms                    -               -         -         -         -  15.259MB  2.6096GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
364.92ms  842.49us          (15625 1 1)       (128 1 1)        13        0B        0B         -           -           -           -  Quadro K2000 (0         1         7  cudapy::__main__::__vectorized_cVestDiscount$242(Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>) [31]
365.77ms  7.1528ms                    -               -         -         -         -  15.259MB  2.0833GB/s      Device    Pageable  Quadro K2000 (0         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$

We see that your application reported a run-time of about 1.244 seconds, whereas the vectorize version reports a runtime of about 0.375 seconds. But there is python overhead in both of these numbers. If we look at the generated CUDA kernel duration in the profiler, the difference is even more stark. We see that the original kernel took about 1.225 seconds whereas the vectorize kernel executes in about 842 microseconds (i.e. less than 1 millisecond). We also note that the computation kernel time is now much, much smaller than the time it takes to transfer the 3 arrays to/from the GPU (which takes about 20 millisconds total) and we note that the kernel dimensions are now 15625 blocks of 128 threads each for a total thread/worker count of 2000000, exactly matching the total number of multiply operations to be done, and substantially more than the paltry 64 threads (and possibly, really only 1 thread) in action with your original code.

Given the simplicity of the above vectorize approach, if what you really want to do is this element-wise multiplication, then you might consider just replicating multBy so that it is dimensionally matching the other two arrays, and be done with it.

But the question remains: how to handle dissimilar input array sizes, as in the original problem? For that I think we need to go to guvectorize (or, as @talonmies indicated, write your own @cuda.jit kernel, which is probably the best advice, notwithstanding the possibility that none of these approaches may overcome the overhead of transferring data to/from the device, as already mentioned).

In order to tackle this with guvectorize, we need to think more carefully about the "slicing" concept already mentioned. Let's re-write your guvectorize kernel so that it only operates on a "slice" of the overall data, and then allow the guvectorize launch function to spin up multiple workers to tackle it, one worker per slice.

In CUDA, we like to have lots of workers; you really can't have too many. So this will affect how we "slice" our arrays, so as to give the possibility for multiple workers to act. If we were to slice along the 3rd (last, n) dimension, we would only have 5 slices to work with, so a maximum of 5 workers. Likewise if we slice along the first, or countRow dimension, we would have 100 slices, so a maximum of 100 workers. Ideally, we would slice along the 2nd, or countCol dimension. However for simplicity, I will slice along the first, or countRow dimension. This is clearly non-optimal, but see below for a worked example of how you might approach the slicing-by-second-dimension problem. Slicing by the first dimension means we will remove the first for-loop from our guvectorize kernel, and allow the ufunc system to parallelize along that dimension (based on sizes of arrays we pass). The code could look something like this:

$ cat t16.py
import numpy as np
from numba import guvectorize
import time
from timeit import default_timer as timer


@guvectorize(['void(float64[:,:], float64[:,:], int64, int64, float64[:,:])'], '(m,o),(m,o),(),() -> (m,o)', target='cuda', nopython=True)
def cVestDiscount (multBy, discount, n, countCol, cv):
        for ID in range(0,countCol):
            for num in range(0,n):
                cv[ID][num] = multBy[ID][num] * discount[ID][num]

multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape=(100,4000,5), dtype=np.float64)
func_start = timer()
cv = cVestDiscount(multBy, discount, n, countCol, cv)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
$ nvprof --print-gpu-trace python t16.py
==4275== NVPROF is profiling process 4275, command: python t16.py
Function: discount factor cumVest duration (seconds):0.0670170783997
==4275== Profiling application: python t16.py
==4275== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
307.05ms  27.392us                    -               -         -         -         -  156.25KB  5.4400GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
307.79ms  5.9293ms                    -               -         -         -         -  15.259MB  2.5131GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
314.34ms  1.3440us                    -               -         -         -         -        8B  5.6766MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
314.54ms     896ns                    -               -         -         -         -        8B  8.5149MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
317.27ms  47.398ms              (2 1 1)        (64 1 1)        63        0B        0B         -           -           -           -  Quadro K2000 (0         1         7  cudapy::__main__::__gufunc_cVestDiscount$242(Array<double, int=3, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>) [35]
364.67ms  7.3799ms                    -               -         -         -         -  15.259MB  2.0192GB/s      Device    Pageable  Quadro K2000 (0         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$

Observations:

The code changes were related to removing the countCol parameter, removing the first for-loop from the guvectorize kernel, and making the appropriate changes to the function signature to reflect this. We also modified our 3-dimensional functions in the signature to two-dimensional. We are taking a two-dimensional "slice" of the 3-dimensional data, after all, and letting each worker work on a slice.
The kernel dimensions as reported by the profiler are now 2 blocks instead of 1. This makes sense, because in the original realization, there was really only 1 "slice" presented, and therefore 1 worker needed, and therefore 1 thread (but numba spun up 1 threadblock of 64 threads). In this realization, there are 100 slices, and numba chose to spin up 2 threadblocks of 64 workers/threads, to provide the needed 100 workers/threads.
The kernel performance reported by the profiler of 47.4ms is now somewhere in between the original (~1.224s) and the massively parallel vectorize version (at ~0.001s). So going from 1 to 100 workers has sped things up considerably, but there are more performance gains possible. If you figure out how to slice on the countCol dimension, you can probably get closer to the vectorize version, performance-wise (see below). Note that the difference between where we are at here (~47ms) and the vectorize version (~1ms) is more than enough to make up for the additional transfer cost (~5ms, or less) of transferring a slightly larger multBy matrix to the device, to facilitate the vectorize simplicity.

Some additional comments on the python timing: I believe the exact behavior of how python is compiling the necessary kernels for the original, vectorize, and guvectorize improved versions is different. If we modify the t15.py code to run a "warm-up" run, then at least the python timing is consistent, trend-wise with the overall wall time and the kernel-only timing:

$ cat t15.py
import numpy as np
from numba import guvectorize,vectorize
import time
from timeit import default_timer as timer


@vectorize(['float64(float64, float64)'], target='cuda')
def cVestDiscount (a, b):
    return a * b

multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
multBy = np.full_like(discount, 1)
cv = np.empty_like(discount)
#warm-up run
cv = cVestDiscount(multBy, discount)
func_start = timer()
cv = cVestDiscount(multBy, discount)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
[bob@cluster2 python]$ time python t14.py
Function: discount factor cumVest duration (seconds):1.24376320839

real    0m2.522s
user    0m1.572s
sys     0m0.809s
$ time python t15.py
Function: discount factor cumVest duration (seconds):0.0228319168091

real    0m1.050s
user    0m0.473s
sys     0m0.445s
$ time python t16.py
Function: discount factor cumVest duration (seconds):0.0665760040283

real    0m1.252s
user    0m0.680s
sys     0m0.441s
$

Now, responding to a question in the comments, effectively: "How would I recast the problem to slice along the 4000 (countCol, or "middle") dimension?"

We can be guided by what worked to slice along the first dimension. One possible approach would be to rearrange the shape of the arrays so that the 4000 dimension was the first dimension, then remove that, similar to what we did in the previous treatment of guvectorize. Here's a worked example:

$ cat t17.py
import numpy as np
from numba import guvectorize
import time
from timeit import default_timer as timer


@guvectorize(['void(int64, float64[:], float64[:,:], int64, float64[:,:])'], '(),(o),(m,o),() -> (m,o)', target='cuda', nopython=True)
def cVestDiscount (countCol, multBy, discount, n, cv):
        for ID in range(0,countCol):
            for num in range(0,n):
                cv[ID][num] = multBy[num] * discount[ID][num]

countRow = np.int64(100)
multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(4000,100,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape=(4000,100,5), dtype=np.float64)
func_start = timer()
cv = cVestDiscount(countRow, multBy, discount, n, cv)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
[bob@cluster2 python]$ python t17.py
Function: discount factor cumVest duration (seconds):0.0266749858856
$ nvprof --print-gpu-trace python t17.py
==8544== NVPROF is profiling process 8544, command: python t17.py
Function: discount factor cumVest duration (seconds):0.0268459320068
==8544== Profiling application: python t17.py
==8544== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
304.92ms  1.1840us                    -               -         -         -         -        8B  6.4437MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
305.36ms  27.392us                    -               -         -         -         -  156.25KB  5.4400GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
306.08ms  6.0208ms                    -               -         -         -         -  15.259MB  2.4749GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
312.44ms  1.0880us                    -               -         -         -         -        8B  7.0123MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
313.59ms  8.9961ms             (63 1 1)        (64 1 1)        63        0B        0B         -           -           -           -  Quadro K2000 (0         1         7  cudapy::__main__::__gufunc_cVestDiscount$242(Array<__int64, int=1, A, mutable, aligned>, Array<double, int=2, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>) [35]
322.59ms  7.2772ms                    -               -         -         -         -  15.259MB  2.0476GB/s      Device    Pageable  Quadro K2000 (0         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$

Somewhat predictably, we observe that the execution time has dropped from ~47ms when we sliced into 100 workers to ~9ms when we slice into 4000 workers. Similarly, we observe that numba is choosing to spin up 63 blocks of 64 threads each for a total of 4032 threads, to handle the 4000 workers needed for this "slicing".

Still not as fast as the ~1ms vectorize kernel (which has many more available parallel "slices" for workers), but quite a bit faster than the ~1.2s kernel proposed in the original question. And the overall walltime of the python code is about 2x faster, even with all the python overhead.

As a final observation, let's revisit the statement I made earlier (and is similar to statements made in the comment and in the other answer):

"I doubt it would be possible to exceed the performance of a well-written host code (e.g. using some parallelization method, such as guvectorize) to do the same thing."

We now have convenient test cases in either t16.py or t17.py that we could work with to test this. For simplicity I'll choose t16.py. We can "convert this back to a CPU code" simply by removing the target designation from the guvectorize ufunc:

$ cat t16a.py
import numpy as np
from numba import guvectorize
import time
from timeit import default_timer as timer


@guvectorize(['void(float64[:,:], float64[:,:], int64, int64, float64[:,:])'], '(m,o),(m,o),(),() -> (m,o)')
def cVestDiscount (multBy, discount, n, countCol, cv):
        for ID in range(0,countCol):
            for num in range(0,n):
                cv[ID][num] = multBy[ID][num] * discount[ID][num]

multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape=(100,4000,5), dtype=np.float64)
func_start = timer()
cv = cVestDiscount(multBy, discount, n, countCol, cv)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
$ time python t16a.py
Function: discount factor cumVest duration (seconds):0.00657796859741

real    0m0.528s
user    0m0.474s
sys     0m0.047s
$

So we see that this CPU-only version runs the function in about 6 milliseconds, and it has no GPU "overhead" such as CUDA initialization, and copy of data to/from GPU. The overall walltime is also our best measurement, at about 0.5s compared to about 1.0s for our best GPU case. So this particular problem, due to its low arithmetic intensity per byte of data transfer, probably isn't well-suited to GPU computation.

回答2:

The reason the gufunc Numba emits and runs is so slow becomes immediately obvious on profiling (numba 0.38.1 with CUDA 8.0)

==24691== Profiling application: python slowvec.py
==24691== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
271.33ms  1.2800us                    -               -         -         -         -        8B  5.9605MB/s  GeForce GTX 970         1         7  [CUDA memcpy HtoD]
271.65ms  14.591us                    -               -         -         -         -  156.25KB  10.213GB/s  GeForce GTX 970         1         7  [CUDA memcpy HtoD]
272.09ms  2.5868ms                    -               -         -         -         -  15.259MB  5.7605GB/s  GeForce GTX 970         1         7  [CUDA memcpy HtoD]
274.98ms     992ns                    -               -         -         -         -        8B  7.6909MB/s  GeForce GTX 970         1         7  [CUDA memcpy HtoD]
275.17ms     640ns                    -               -         -         -         -        8B  11.921MB/s  GeForce GTX 970         1         7  [CUDA memcpy HtoD]
276.33ms  657.28ms              (1 1 1)        (64 1 1)        40        0B        0B         -           -  GeForce GTX 970         1         7  cudapy::__main__::__gufunc_cVestDiscount$242(Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>) [38]
933.62ms  3.5128ms                    -               -         -         -         -  15.259MB  4.2419GB/s  GeForce GTX 970         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.

The resulting kernel launch which runs the code is using a single block of 64 threads. On a GPU which can theoretically have up to 2048 threads per MP, and 23 MP, that means about 99.9% of the theoretical processing capacity of your GPU is not being used. This looks like a ridiculous design choice by the numba developers and I would be reporting it as a bug if you are being impeded by it (and it seems you are).

The obvious solution is to rewrite your function as a @cuda.jit function in the CUDA python kernel dialect and take explicit control of execution parameters. That way you can at least ensure that the code will be run with enough threads to potential use all the capacity of your hardware. It is still a very memory bound operation, so what you can achieve in speed up might be restricted to considerably less than the ratio of memory bandwidth of your GPU to your CPU. And that might well not be enough to amortize the cost of the host to device memory transfers, so there might be no performance gains in the best possible case, even though this is far from that.

In short, beware the perils of automagic compiler generated parallelism....

Postscript to add that I managed to work out how to get the PTX of the code emitted by numba, and suffice to say it is absolutely craptulacular (and so long I can't actually post all of it):

{
    .reg .pred  %p<9>;
    .reg .b32   %r<8>;
    .reg .f64   %fd<4>;
    .reg .b64   %rd<137>;


    ld.param.u64    %rd29, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_5];
    ld.param.u64    %rd31, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_11];
    ld.param.u64    %rd32, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_12];
    ld.param.u64    %rd34, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_14];
    ld.param.u64    %rd35, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_15];
    ld.param.u64    %rd36, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_16];
    ld.param.u64    %rd37, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_17];
    ld.param.u64    %rd38, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_22];
    ld.param.u64    %rd39, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_23];
    ld.param.u64    %rd40, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_24];
    ld.param.u64    %rd41, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_25];
    ld.param.u64    %rd42, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_26];
    ld.param.u64    %rd43, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_27];
    ld.param.u64    %rd44, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_28];
    ld.param.u64    %rd45, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_29];
    ld.param.u64    %rd46, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_30];
    ld.param.u64    %rd48, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_36];
    ld.param.u64    %rd51, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_43];
    ld.param.u64    %rd53, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_49];
    ld.param.u64    %rd54, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_50];
    ld.param.u64    %rd55, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_51];
    ld.param.u64    %rd56, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_52];
    ld.param.u64    %rd57, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_53];
    ld.param.u64    %rd58, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_54];
    ld.param.u64    %rd59, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_55];
    ld.param.u64    %rd60, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_56];
    ld.param.u64    %rd61, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_57];
    mov.u32     %r1, %tid.x;
    mov.u32     %r3, %ctaid.x;
    mov.u32     %r2, %ntid.x;
    mad.lo.s32  %r4, %r3, %r2, %r1;
    min.s64     %rd62, %rd32, %rd29;
    min.s64     %rd63, %rd39, %rd62;
    min.s64     %rd64, %rd48, %rd63;
    min.s64     %rd65, %rd51, %rd64;
    min.s64     %rd66, %rd54, %rd65;
    cvt.s64.s32 %rd1, %r4;
    setp.le.s64 %p2, %rd66, %rd1;
    @%p2 bra    BB0_8;

    ld.param.u64    %rd126, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_42];
    ld.param.u64    %rd125, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_44];
    ld.param.u64    %rd124, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_35];
    ld.param.u64    %rd123, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_37];
    ld.param.u64    %rd122, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_4];
    ld.param.u64    %rd121, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_6];
    cvt.u32.u64 %r5, %rd1;
    setp.lt.s32 %p1, %r5, 0;
    selp.b64    %rd67, %rd29, 0, %p1;
    add.s64     %rd68, %rd67, %rd1;
    mul.lo.s64  %rd69, %rd68, %rd121;
    add.s64     %rd70, %rd69, %rd122;
    selp.b64    %rd71, %rd48, 0, %p1;
    add.s64     %rd72, %rd71, %rd1;
    mul.lo.s64  %rd73, %rd72, %rd123;
    add.s64     %rd74, %rd73, %rd124;
    ld.u64  %rd2, [%rd74];
    selp.b64    %rd75, %rd51, 0, %p1;
    add.s64     %rd76, %rd75, %rd1;
    mul.lo.s64  %rd77, %rd76, %rd125;
    add.s64     %rd78, %rd77, %rd126;
    ld.u64  %rd3, [%rd78];
    ld.u64  %rd4, [%rd70];
    setp.lt.s64 %p3, %rd4, 1;
    @%p3 bra    BB0_8;

    ld.param.u64    %rd128, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_13];
    ld.param.u64    %rd127, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_12];
    selp.b64    %rd80, %rd127, 0, %p1;
    mov.u64     %rd79, 0;
    min.s64     %rd81, %rd128, %rd79;
    min.s64     %rd82, %rd34, %rd79;
    selp.b64    %rd83, %rd39, 0, %p1;
    min.s64     %rd84, %rd40, %rd79;
    min.s64     %rd85, %rd41, %rd79;
    min.s64     %rd86, %rd42, %rd79;
    selp.b64    %rd87, %rd54, 0, %p1;
    min.s64     %rd88, %rd55, %rd79;
    min.s64     %rd89, %rd56, %rd79;
    min.s64     %rd90, %rd57, %rd79;
    mul.lo.s64  %rd91, %rd90, %rd61;
    add.s64     %rd92, %rd53, %rd91;
    mul.lo.s64  %rd93, %rd89, %rd60;
    add.s64     %rd94, %rd92, %rd93;
    mul.lo.s64  %rd95, %rd88, %rd59;
    add.s64     %rd96, %rd94, %rd95;
    add.s64     %rd98, %rd87, %rd1;
    mul.lo.s64  %rd99, %rd58, %rd98;
    add.s64     %rd5, %rd96, %rd99;
    mul.lo.s64  %rd100, %rd86, %rd46;
    add.s64     %rd101, %rd38, %rd100;
    mul.lo.s64  %rd102, %rd85, %rd45;
    add.s64     %rd103, %rd101, %rd102;
    mul.lo.s64  %rd104, %rd84, %rd44;
    add.s64     %rd105, %rd103, %rd104;
    add.s64     %rd106, %rd83, %rd1;
    mul.lo.s64  %rd107, %rd43, %rd106;
    add.s64     %rd6, %rd105, %rd107;
    mul.lo.s64  %rd108, %rd82, %rd37;
    add.s64     %rd109, %rd31, %rd108;
    mul.lo.s64  %rd110, %rd81, %rd36;
    add.s64     %rd111, %rd109, %rd110;
    add.s64     %rd112, %rd80, %rd1;
    mul.lo.s64  %rd113, %rd35, %rd112;
    add.s64     %rd7, %rd111, %rd113;
    add.s64     %rd8, %rd2, 1;
    mov.u64     %rd131, %rd79;

BB0_3:
    mul.lo.s64  %rd115, %rd59, %rd131;
    add.s64     %rd10, %rd5, %rd115;
    mul.lo.s64  %rd116, %rd44, %rd131;
    add.s64     %rd11, %rd6, %rd116;
    setp.lt.s64 %p4, %rd3, 1;
    mov.u64     %rd130, %rd79;
    mov.u64     %rd132, %rd3;
    @%p4 bra    BB0_7;

BB0_4:
    mov.u64     %rd13, %rd132;
    mov.u64     %rd12, %rd130;
    mul.lo.s64  %rd117, %rd60, %rd12;
    add.s64     %rd136, %rd10, %rd117;
    mul.lo.s64  %rd118, %rd45, %rd12;
    add.s64     %rd135, %rd11, %rd118;
    mul.lo.s64  %rd119, %rd36, %rd12;
    add.s64     %rd134, %rd7, %rd119;
    setp.lt.s64 %p5, %rd2, 1;
    mov.u64     %rd133, %rd8;
    @%p5 bra    BB0_6;

BB0_5:
    mov.u64     %rd17, %rd133;
    ld.f64  %fd1, [%rd135];
    ld.f64  %fd2, [%rd134];
    mul.f64     %fd3, %fd2, %fd1;
    st.f64  [%rd136], %fd3;
    add.s64     %rd136, %rd136, %rd61;
    add.s64     %rd135, %rd135, %rd46;
    add.s64     %rd134, %rd134, %rd37;
    add.s64     %rd24, %rd17, -1;
    setp.gt.s64 %p6, %rd24, 1;
    mov.u64     %rd133, %rd24;
    @%p6 bra    BB0_5;

BB0_6:
    add.s64     %rd25, %rd13, -1;
    add.s64     %rd26, %rd12, 1;
    setp.gt.s64 %p7, %rd13, 1;
    mov.u64     %rd130, %rd26;
    mov.u64     %rd132, %rd25;
    @%p7 bra    BB0_4;

BB0_7:
    sub.s64     %rd120, %rd4, %rd131;
    add.s64     %rd131, %rd131, 1;
    setp.gt.s64 %p8, %rd120, 1;
    @%p8 bra    BB0_3;

BB0_8:
    ret;
}

All of those integer operations to perform exactly one double precision multiplication!

来源：https://stackoverflow.com/questions/52046102/numba-and-guvectorize-for-cuda-target-code-running-slower-than-expected

标签

python

performance

cuda

numba

nvprof