Cuda Parallelize Kernel

问题

I'm trying to parallelize a simple update loop of a simulation on the GPU. Basically there are a bunch of "creatures" represented by circles that in each update loop will move and then there will be a check of whether any of them intersect. radii is the radius of the different types of creatures.

import numpy as np
import math
from numba import cuda


@cuda.jit('void(float32[:], float32[:], float32[:], uint8[:], float32[:], float32[:], float32, uint32, uint32)')
def update(p_x, p_y, radii, types, velocities, max_velocities, acceleration, num_creatures, cycles):
    for c in range(cycles):
        for i in range(num_creatures):
            velocities[i] = velocities[i] + acceleration
            if velocities[i] > max_velocities[i]:
                velocities[i] = max_velocities[i]
            p_x[i] = p_x[i] + (math.cos(1.0) * velocities[i])
            p_y[i] = p_y[i] + (math.sin(1.0) * velocities[i])
        for i in range(num_creatures):
            for j in range(i, num_creatures):
                delta_x = p_x[j] - p_x[i]
                delta_y = p_y[j] - p_y[i]
                distance_squared = (delta_x * delta_x) + (delta_y * delta_y)
                sum_of_radii = radii[types[i]] + radii[types[i]]
                if distance_squared < sum_of_radii * sum_of_radii:
                    pass


acceleration = .1
creature_radius = 10
spacing = 20
food_radius = 3

max_num_creatures = 1500
num_creatures = 0
max_num_food = 500
num_food = 0
max_num_entities = max_num_creatures + max_num_food
num_entities = 0
cycles = 1


p_x = np.zeros(max_num_entities, dtype=np.float32)
p_y = np.zeros(max_num_entities, dtype=np.float32)
radii = np.array([creature_radius, creature_radius, food_radius], dtype=np.float32)
types = np.zeros(max_num_entities, dtype=np.uint8)

velocities = np.zeros(max_num_creatures, dtype=np.float32)
max_velocities = np.zeros(max_num_creatures, dtype=np.float32)
# types:
# male - 0
# female - 1
# food - 2
for x in range(1, 800 // spacing):
    for y in range(1, 600 // spacing):
        if num_creatures % 2 == 0:
            types[num_creatures] = 0
        else:
            types[num_creatures] = 1
        p_x[num_creatures] = x * spacing
        p_y[num_creatures] = y * spacing
        max_velocities[num_creatures] = 5
        num_creatures += 1


device_p_x = cuda.to_device(p_x)
device_p_y = cuda.to_device(p_y)
device_radii = cuda.to_device(radii)
device_types = cuda.to_device(types)
device_velocities = cuda.to_device(velocities)
device_max_velocities = cuda.to_device(max_velocities)
threadsperblock = 64
blockspergrid = 16
update[blockspergrid, threadsperblock](device_p_x, device_p_y, device_radii, device_types, device_velocities, device_max_velocities,
        acceleration, num_creatures, cycles)
print(device_p_x.copy_to_host())

The 1.0 in math.cos and math.sin is just a placeholder for the directions of the individual creatures.

As it is now there are multiple threads, but they execute the same code. What changes do I have to make to the kernel to parallelize it?

回答1:

The most obvious dimension for parallelization to me seems to be the loop in i in your kernel, that is iterating over num_creatures. So I'll describe how to do that.

Our goal will be to remove the loop on num_creatures, and instead let each iteration of the loop be handled by a separate CUDA thread. This is possible because the work done in each loop iteration is (mostly) independent - it does not depend on the results of other loop iterations (but see 2 below).
A challenge we will run into is that the 2nd i for-loop in num_creatures presumably depends on the results of the first. If we leave everything as serial code running in a single thread, then that dependency is taken care of by the nature of serial code execution. However we want to parallelize this. Therefore we need a global sync in between the first for loop in num_creatures and the 2nd. A simple, convenient global sync in CUDA is the kernel launch, so we'll break the kernel code into two kernel functions. We'll call these update1 and update2
This then presents the challenge about what to do about the over-arching loop in cycles. We can't simply replicate that loop in both kernels, because that would change the functional behavior - we would compute cycles updates to p_x before computing a single calculation of delta_x, for example. That is presumably not what is wanted. So, for simplicity, we'll hoist this loop out of the kernel code and back into host code. The host code will then call update1 and update2 kernels for cycles iterations.
We also want to make the kernel processing adaptable to different sizes of num_creatures. So we'll pick a hard-coded size for threadsperblock but we will make the number of blocks launched variable, based on the size of num_creatures. To facilitate this, we need a thread-check (initial if-statement) in each of our kernels, so that "extra" threads don't do anything.

With that description, we end up with something like this:

$ cat t11.py
import numpy as np
import math
from numba import cuda


@cuda.jit('void(float32[:], float32[:], float32[:], float32[:], float32, uint32)')
def update1(p_x, p_y, velocities, max_velocities, acceleration, num_creatures):
    i = cuda.grid(1)
    if i < num_creatures:
            velocities[i] = velocities[i] + acceleration
            if velocities[i] > max_velocities[i]:
                velocities[i] = max_velocities[i]
            p_x[i] = p_x[i] + (math.cos(1.0) * velocities[i])
            p_y[i] = p_y[i] + (math.sin(1.0) * velocities[i])

@cuda.jit('void(float32[:], float32[:], float32[:], uint8[:], uint32)')
def update2(p_x, p_y, radii, types, num_creatures):
    i = cuda.grid(1)
    if i < num_creatures:
            for j in range(i, num_creatures):
                delta_x = p_x[j] - p_x[i]
                delta_y = p_y[j] - p_y[i]
                distance_squared = (delta_x * delta_x) + (delta_y * delta_y)
                sum_of_radii = radii[types[i]] + radii[types[i]]
                if distance_squared < sum_of_radii * sum_of_radii:
                    pass


acceleration = .1
creature_radius = 10
spacing = 20
food_radius = 3

max_num_creatures = 1500
num_creatures = 0
max_num_food = 500
num_food = 0
max_num_entities = max_num_creatures + max_num_food
num_entities = 0
cycles = 1


p_x = np.zeros(max_num_entities, dtype=np.float32)
p_y = np.zeros(max_num_entities, dtype=np.float32)
radii = np.array([creature_radius, creature_radius, food_radius], dtype=np.float32)
types = np.zeros(max_num_entities, dtype=np.uint8)

velocities = np.zeros(max_num_creatures, dtype=np.float32)
max_velocities = np.zeros(max_num_creatures, dtype=np.float32)
# types:
# male - 0
# female - 1
# food - 2
for x in range(1, 800 // spacing):
    for y in range(1, 600 // spacing):
        if num_creatures % 2 == 0:
            types[num_creatures] = 0
        else:
            types[num_creatures] = 1
        p_x[num_creatures] = x * spacing
        p_y[num_creatures] = y * spacing
        max_velocities[num_creatures] = 5
        num_creatures += 1


device_p_x = cuda.to_device(p_x)
device_p_y = cuda.to_device(p_y)
device_radii = cuda.to_device(radii)
device_types = cuda.to_device(types)
device_velocities = cuda.to_device(velocities)
device_max_velocities = cuda.to_device(max_velocities)
threadsperblock = 64
blockspergrid = (num_creatures // threadsperblock) + 1
for i in range(cycles):
    update1[blockspergrid, threadsperblock](device_p_x, device_p_y, device_velocities, device_max_velocities, acceleration, num_creatures)
    update2[blockspergrid, threadsperblock](device_p_x, device_p_y, device_radii, device_types, num_creatures)
print(device_p_x.copy_to_host())
$ python t11.py
[ 20.05402946  20.05402946  20.05402946 ...,   0.           0.           0.        ]
$

which produces the same output as the original posted version (the original code is running 16 blocks of 64 threads doing exactly the same thing, and stepping on each other as they write to the same data. So I'm referring to the original posted version but running one block of one thread).

Note that there are certainly other ways to parallelize, and probably other optimizations possible, but this should give you a sensible starting point for CUDA work.

As mentioned to you in your previous question, the second kernel here really doesn't do anything useful, but I assume that is just a placeholder for future work. I'm also not sure you'll get what you want with your usage of radii here, but that's also not the focus of this question.

So what is the effect of all this performance wise? Again we will compare the original posted version (t12.py, below) running just one block of one thread (not 16 blocks of 64 threads, which would only be worse, anyway) against this version which happens to be running 18 blocks of 64 threads (t11.py, below):

$ nvprof --print-gpu-trace python t11.py
==3551== NVPROF is profiling process 3551, command: python t11.py
[ 20.05402946  20.05402946  20.05402946 ...,   0.           0.           0.        ]
==3551== Profiling application: python t11.py
==3551== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
446.77ms  1.8240us                    -               -         -         -         -  7.8125KB  4.0847GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
446.97ms  1.7600us                    -               -         -         -         -  7.8125KB  4.2333GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
447.35ms  1.2160us                    -               -         -         -         -       12B  9.4113MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
447.74ms  1.3440us                    -               -         -         -         -  1.9531KB  1.3859GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
447.93ms  1.5040us                    -               -         -         -         -  5.8594KB  3.7154GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
448.13ms  1.5360us                    -               -         -         -         -  5.8594KB  3.6380GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
448.57ms  5.4720us             (18 1 1)        (64 1 1)        36        0B        0B         -           -           -           -  Quadro K2000 (0         1         7  cudapy::__main__::update1$241(Array<float, int=1, A, mutable, aligned>, Array<float, int=1, A, mutable, aligned>, Array<float, int=1, A, mutable, aligned>, Array<float, int=1, A, mutable, aligned>, float, unsigned int) [49]
448.82ms  1.1200us             (18 1 1)        (64 1 1)         8        0B        0B         -           -           -           -  Quadro K2000 (0         1         7  cudapy::__main__::update2$242(Array<float, int=1, A, mutable, aligned>, Array<float, int=1, A, mutable, aligned>, Array<float, int=1, A, mutable, aligned>, Array<unsigned char, int=1, A, mutable, aligned>, unsigned int) [50]
448.90ms  2.1120us                    -               -         -         -         -  7.8125KB  3.5277GB/s      Device    Pageable  Quadro K2000 (0         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy

$ python t12.py
[ 20.05402946  20.05402946  20.05402946 ...,   0.           0.           0.        ]
$ nvprof --print-gpu-trace python t12.py
==3604== NVPROF is profiling process 3604, command: python t12.py
[ 20.05402946  20.05402946  20.05402946 ...,   0.           0.           0.        ]
==3604== Profiling application: python t12.py
==3604== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
296.22ms  1.8240us                    -               -         -         -         -  7.8125KB  4.0847GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
296.41ms  1.7920us                    -               -         -         -         -  7.8125KB  4.1577GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
296.79ms  1.2160us                    -               -         -         -         -       12B  9.4113MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
297.21ms  1.3440us                    -               -         -         -         -  1.9531KB  1.3859GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
297.40ms  1.5040us                    -               -         -         -         -  5.8594KB  3.7154GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
297.60ms  1.5360us                    -               -         -         -         -  5.8594KB  3.6380GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
298.05ms  1.8453ms              (1 1 1)         (1 1 1)        36        0B        0B         -           -           -           -  Quadro K2000 (0         1         7  cudapy::__main__::update$241(Array<float, int=1, A, mutable, aligned>, Array<float, int=1, A, mutable, aligned>, Array<float, int=1, A, mutable, aligned>, Array<unsigned char, int=1, A, mutable, aligned>, Array<float, int=1, A, mutable, aligned>, Array<float, int=1, A, mutable, aligned>, float, unsigned int, unsigned int) [38]
299.91ms  2.1120us                    -               -         -         -         -  7.8125KB  3.5277GB/s      Device    Pageable  Quadro K2000 (0         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$

We see that the profiler reports for the original t12.py version, that there is a single update kernel running, with 1 block and 1 thread, and it is taking 1.8453 milliseconds. For the modified t11.py version, posted in this answer, the profiler reports 18 blocks of 64 threads each, for both update1 and update2 kernels, and the combined execution time of these two kernels is approximately 5.47 + 1.12 = 6.59 microseconds.

EDIT: Based on some discussion in the comments, it should be possible to combine both kernels into a single kernel, using a double-buffering scheme on p_x and p_y:

$ cat t11.py
import numpy as np
import math
from numba import cuda


@cuda.jit('void(float32[:], float32[:], float32[:], float32[:], float32[:], uint8[:], float32[:], float32[:], float32, uint32)')
def update(p_x, p_y, p_x_new, p_y_new, radii, types, velocities, max_velocities, acceleration, num_creatures):
    i = cuda.grid(1)
    if i < num_creatures:
            velocities[i] = velocities[i] + acceleration
            if velocities[i] > max_velocities[i]:
                velocities[i] = max_velocities[i]
            p_x_new[i] = p_x[i] + (math.cos(1.0) * velocities[i])
            p_y_new[i] = p_y[i] + (math.sin(1.0) * velocities[i])
            for j in range(i, num_creatures):
                delta_x = p_x[j] - p_x[i]
                delta_y = p_y[j] - p_y[i]
                distance_squared = (delta_x * delta_x) + (delta_y * delta_y)
                sum_of_radii = radii[types[i]] + radii[types[i]]
                if distance_squared < sum_of_radii * sum_of_radii:
                    pass


acceleration = .1
creature_radius = 10
spacing = 20
food_radius = 3

max_num_creatures = 1500000
num_creatures = 0
max_num_food = 500
num_food = 0
max_num_entities = max_num_creatures + max_num_food
num_entities = 0
cycles = 2


p_x = np.zeros(max_num_entities, dtype=np.float32)
p_y = np.zeros(max_num_entities, dtype=np.float32)
radii = np.array([creature_radius, creature_radius, food_radius], dtype=np.float32)
types = np.zeros(max_num_entities, dtype=np.uint8)

velocities = np.zeros(max_num_creatures, dtype=np.float32)
max_velocities = np.zeros(max_num_creatures, dtype=np.float32)
# types:
# male - 0
# female - 1
# food - 2
for x in range(1, 80000 // spacing):
    for y in range(1, 6000 // spacing):
        if num_creatures % 2 == 0:
            types[num_creatures] = 0
        else:
            types[num_creatures] = 1
        p_x[num_creatures] = x * spacing
        p_y[num_creatures] = y * spacing
        max_velocities[num_creatures] = 5
        num_creatures += 1


device_p_x = cuda.to_device(p_x)
device_p_y = cuda.to_device(p_y)
device_p_x_new = cuda.to_device(p_x)
device_p_y_new = cuda.to_device(p_y)
device_radii = cuda.to_device(radii)
device_types = cuda.to_device(types)
device_velocities = cuda.to_device(velocities)
device_max_velocities = cuda.to_device(max_velocities)
threadsperblock = 64
blockspergrid = (num_creatures // threadsperblock) + 1
for i in range(cycles):
    if i % 2 == 0:
        update[blockspergrid, threadsperblock](device_p_x, device_p_y, device_p_x_new, device_p_y_new, device_radii, device_types,  device_velocities, device_max_velocities, acceleration, num_creatures)
    else:
        update[blockspergrid, threadsperblock](device_p_x_new, device_p_y_new, device_p_x, device_p_y, device_radii, device_types,  device_velocities, device_max_velocities, acceleration, num_creatures)

print(device_p_x_new.copy_to_host())
print(device_p_x.copy_to_host())
$ python t11.py
[ 20.05402946  20.05402946  20.05402946 ...,   0.           0.           0.        ]
[ 20.1620903  20.1620903  20.1620903 ...,   0.          0.          0.       ]
$

It is still necessary to preserve the kernel-calling loop in cycles in host code, since we still require the global sync provided by the kernel call. But for a given number of cycles, this will reduce the contribution of the kernel-call overhead.

Using this technique, care must be taken in the choice of cycles as well as the use of data from either the p_x or p_x_new buffer, for coherent results.

来源：https://stackoverflow.com/questions/52129778/cuda-parallelize-kernel

标签

python

parallel-processing

cuda

numba