问题
I'm trying to parallelize a simple update loop of a simulation on the GPU. Basically there are a bunch of "creatures" represented by circles that in each update loop will move and then there will be a check of whether any of them intersect. radii is the radius of the different types of creatures.
import numpy as np
import math
from numba import cuda
@cuda.jit('void(float32[:], float32[:], float32[:], uint8[:], float32[:], float32[:], float32, uint32, uint32)')
def update(p_x, p_y, radii, types, velocities, max_velocities, acceleration, num_creatures, cycles):
for c in range(cycles):
for i in range(num_creatures):
velocities[i] = velocities[i] + acceleration
if velocities[i] > max_velocities[i]:
velocities[i] = max_velocities[i]
p_x[i] = p_x[i] + (math.cos(1.0) * velocities[i])
p_y[i] = p_y[i] + (math.sin(1.0) * velocities[i])
for i in range(num_creatures):
for j in range(i, num_creatures):
delta_x = p_x[j] - p_x[i]
delta_y = p_y[j] - p_y[i]
distance_squared = (delta_x * delta_x) + (delta_y * delta_y)
sum_of_radii = radii[types[i]] + radii[types[i]]
if distance_squared < sum_of_radii * sum_of_radii:
pass
acceleration = .1
creature_radius = 10
spacing = 20
food_radius = 3
max_num_creatures = 1500
num_creatures = 0
max_num_food = 500
num_food = 0
max_num_entities = max_num_creatures + max_num_food
num_entities = 0
cycles = 1
p_x = np.zeros(max_num_entities, dtype=np.float32)
p_y = np.zeros(max_num_entities, dtype=np.float32)
radii = np.array([creature_radius, creature_radius, food_radius], dtype=np.float32)
types = np.zeros(max_num_entities, dtype=np.uint8)
velocities = np.zeros(max_num_creatures, dtype=np.float32)
max_velocities = np.zeros(max_num_creatures, dtype=np.float32)
# types:
# male - 0
# female - 1
# food - 2
for x in range(1, 800 // spacing):
for y in range(1, 600 // spacing):
if num_creatures % 2 == 0:
types[num_creatures] = 0
else:
types[num_creatures] = 1
p_x[num_creatures] = x * spacing
p_y[num_creatures] = y * spacing
max_velocities[num_creatures] = 5
num_creatures += 1
device_p_x = cuda.to_device(p_x)
device_p_y = cuda.to_device(p_y)
device_radii = cuda.to_device(radii)
device_types = cuda.to_device(types)
device_velocities = cuda.to_device(velocities)
device_max_velocities = cuda.to_device(max_velocities)
threadsperblock = 64
blockspergrid = 16
update[blockspergrid, threadsperblock](device_p_x, device_p_y, device_radii, device_types, device_velocities, device_max_velocities,
acceleration, num_creatures, cycles)
print(device_p_x.copy_to_host())
The 1.0 in math.cos and math.sin is just a placeholder for the directions of the individual creatures.
As it is now there are multiple threads, but they execute the same code. What changes do I have to make to the kernel to parallelize it?
回答1:
The most obvious dimension for parallelization to me seems to be the loop in i
in your kernel, that is iterating over num_creatures
. So I'll describe how to do that.
Our goal will be to remove the loop on
num_creatures
, and instead let each iteration of the loop be handled by a separate CUDA thread. This is possible because the work done in each loop iteration is (mostly) independent - it does not depend on the results of other loop iterations (but see 2 below).A challenge we will run into is that the 2nd
i
for-loop innum_creatures
presumably depends on the results of the first. If we leave everything as serial code running in a single thread, then that dependency is taken care of by the nature of serial code execution. However we want to parallelize this. Therefore we need a global sync in between the first for loop innum_creatures
and the 2nd. A simple, convenient global sync in CUDA is the kernel launch, so we'll break the kernel code into two kernel functions. We'll call theseupdate1
andupdate2
This then presents the challenge about what to do about the over-arching loop in
cycles
. We can't simply replicate that loop in both kernels, because that would change the functional behavior - we would computecycles
updates top_x
before computing a single calculation ofdelta_x
, for example. That is presumably not what is wanted. So, for simplicity, we'll hoist this loop out of the kernel code and back into host code. The host code will then callupdate1
andupdate2
kernels forcycles
iterations.We also want to make the kernel processing adaptable to different sizes of
num_creatures
. So we'll pick a hard-coded size forthreadsperblock
but we will make the number of blocks launched variable, based on the size ofnum_creatures
. To facilitate this, we need a thread-check (initial if-statement) in each of our kernels, so that "extra" threads don't do anything.
With that description, we end up with something like this:
$ cat t11.py
import numpy as np
import math
from numba import cuda
@cuda.jit('void(float32[:], float32[:], float32[:], float32[:], float32, uint32)')
def update1(p_x, p_y, velocities, max_velocities, acceleration, num_creatures):
i = cuda.grid(1)
if i < num_creatures:
velocities[i] = velocities[i] + acceleration
if velocities[i] > max_velocities[i]:
velocities[i] = max_velocities[i]
p_x[i] = p_x[i] + (math.cos(1.0) * velocities[i])
p_y[i] = p_y[i] + (math.sin(1.0) * velocities[i])
@cuda.jit('void(float32[:], float32[:], float32[:], uint8[:], uint32)')
def update2(p_x, p_y, radii, types, num_creatures):
i = cuda.grid(1)
if i < num_creatures:
for j in range(i, num_creatures):
delta_x = p_x[j] - p_x[i]
delta_y = p_y[j] - p_y[i]
distance_squared = (delta_x * delta_x) + (delta_y * delta_y)
sum_of_radii = radii[types[i]] + radii[types[i]]
if distance_squared < sum_of_radii * sum_of_radii:
pass
acceleration = .1
creature_radius = 10
spacing = 20
food_radius = 3
max_num_creatures = 1500
num_creatures = 0
max_num_food = 500
num_food = 0
max_num_entities = max_num_creatures + max_num_food
num_entities = 0
cycles = 1
p_x = np.zeros(max_num_entities, dtype=np.float32)
p_y = np.zeros(max_num_entities, dtype=np.float32)
radii = np.array([creature_radius, creature_radius, food_radius], dtype=np.float32)
types = np.zeros(max_num_entities, dtype=np.uint8)
velocities = np.zeros(max_num_creatures, dtype=np.float32)
max_velocities = np.zeros(max_num_creatures, dtype=np.float32)
# types:
# male - 0
# female - 1
# food - 2
for x in range(1, 800 // spacing):
for y in range(1, 600 // spacing):
if num_creatures % 2 == 0:
types[num_creatures] = 0
else:
types[num_creatures] = 1
p_x[num_creatures] = x * spacing
p_y[num_creatures] = y * spacing
max_velocities[num_creatures] = 5
num_creatures += 1
device_p_x = cuda.to_device(p_x)
device_p_y = cuda.to_device(p_y)
device_radii = cuda.to_device(radii)
device_types = cuda.to_device(types)
device_velocities = cuda.to_device(velocities)
device_max_velocities = cuda.to_device(max_velocities)
threadsperblock = 64
blockspergrid = (num_creatures // threadsperblock) + 1
for i in range(cycles):
update1[blockspergrid, threadsperblock](device_p_x, device_p_y, device_velocities, device_max_velocities, acceleration, num_creatures)
update2[blockspergrid, threadsperblock](device_p_x, device_p_y, device_radii, device_types, num_creatures)
print(device_p_x.copy_to_host())
$ python t11.py
[ 20.05402946 20.05402946 20.05402946 ..., 0. 0. 0. ]
$
which produces the same output as the original posted version (the original code is running 16 blocks of 64 threads doing exactly the same thing, and stepping on each other as they write to the same data. So I'm referring to the original posted version but running one block of one thread).
Note that there are certainly other ways to parallelize, and probably other optimizations possible, but this should give you a sensible starting point for CUDA work.
As mentioned to you in your previous question, the second kernel here really doesn't do anything useful, but I assume that is just a placeholder for future work. I'm also not sure you'll get what you want with your usage of radii
here, but that's also not the focus of this question.
So what is the effect of all this performance wise? Again we will compare the original posted version (t12.py
, below) running just one block of one thread (not 16 blocks of 64 threads, which would only be worse, anyway) against this version which happens to be running 18 blocks of 64 threads (t11.py
, below):
$ nvprof --print-gpu-trace python t11.py
==3551== NVPROF is profiling process 3551, command: python t11.py
[ 20.05402946 20.05402946 20.05402946 ..., 0. 0. 0. ]
==3551== Profiling application: python t11.py
==3551== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput SrcMemType DstMemType Device Context Stream Name
446.77ms 1.8240us - - - - - 7.8125KB 4.0847GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
446.97ms 1.7600us - - - - - 7.8125KB 4.2333GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
447.35ms 1.2160us - - - - - 12B 9.4113MB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
447.74ms 1.3440us - - - - - 1.9531KB 1.3859GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
447.93ms 1.5040us - - - - - 5.8594KB 3.7154GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
448.13ms 1.5360us - - - - - 5.8594KB 3.6380GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
448.57ms 5.4720us (18 1 1) (64 1 1) 36 0B 0B - - - - Quadro K2000 (0 1 7 cudapy::__main__::update1$241(Array<float, int=1, A, mutable, aligned>, Array<float, int=1, A, mutable, aligned>, Array<float, int=1, A, mutable, aligned>, Array<float, int=1, A, mutable, aligned>, float, unsigned int) [49]
448.82ms 1.1200us (18 1 1) (64 1 1) 8 0B 0B - - - - Quadro K2000 (0 1 7 cudapy::__main__::update2$242(Array<float, int=1, A, mutable, aligned>, Array<float, int=1, A, mutable, aligned>, Array<float, int=1, A, mutable, aligned>, Array<unsigned char, int=1, A, mutable, aligned>, unsigned int) [50]
448.90ms 2.1120us - - - - - 7.8125KB 3.5277GB/s Device Pageable Quadro K2000 (0 1 7 [CUDA memcpy DtoH]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$ python t12.py
[ 20.05402946 20.05402946 20.05402946 ..., 0. 0. 0. ]
$ nvprof --print-gpu-trace python t12.py
==3604== NVPROF is profiling process 3604, command: python t12.py
[ 20.05402946 20.05402946 20.05402946 ..., 0. 0. 0. ]
==3604== Profiling application: python t12.py
==3604== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput SrcMemType DstMemType Device Context Stream Name
296.22ms 1.8240us - - - - - 7.8125KB 4.0847GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
296.41ms 1.7920us - - - - - 7.8125KB 4.1577GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
296.79ms 1.2160us - - - - - 12B 9.4113MB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
297.21ms 1.3440us - - - - - 1.9531KB 1.3859GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
297.40ms 1.5040us - - - - - 5.8594KB 3.7154GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
297.60ms 1.5360us - - - - - 5.8594KB 3.6380GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
298.05ms 1.8453ms (1 1 1) (1 1 1) 36 0B 0B - - - - Quadro K2000 (0 1 7 cudapy::__main__::update$241(Array<float, int=1, A, mutable, aligned>, Array<float, int=1, A, mutable, aligned>, Array<float, int=1, A, mutable, aligned>, Array<unsigned char, int=1, A, mutable, aligned>, Array<float, int=1, A, mutable, aligned>, Array<float, int=1, A, mutable, aligned>, float, unsigned int, unsigned int) [38]
299.91ms 2.1120us - - - - - 7.8125KB 3.5277GB/s Device Pageable Quadro K2000 (0 1 7 [CUDA memcpy DtoH]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$
We see that the profiler reports for the original t12.py
version, that there is a single update
kernel running, with 1 block and 1 thread, and it is taking 1.8453 milliseconds. For the modified t11.py
version, posted in this answer, the profiler reports 18 blocks of 64 threads each, for both update1
and update2
kernels, and the combined execution time of these two kernels is approximately 5.47 + 1.12 = 6.59 microseconds.
EDIT:
Based on some discussion in the comments, it should be possible to combine both kernels into a single kernel, using a double-buffering scheme on p_x
and p_y
:
$ cat t11.py
import numpy as np
import math
from numba import cuda
@cuda.jit('void(float32[:], float32[:], float32[:], float32[:], float32[:], uint8[:], float32[:], float32[:], float32, uint32)')
def update(p_x, p_y, p_x_new, p_y_new, radii, types, velocities, max_velocities, acceleration, num_creatures):
i = cuda.grid(1)
if i < num_creatures:
velocities[i] = velocities[i] + acceleration
if velocities[i] > max_velocities[i]:
velocities[i] = max_velocities[i]
p_x_new[i] = p_x[i] + (math.cos(1.0) * velocities[i])
p_y_new[i] = p_y[i] + (math.sin(1.0) * velocities[i])
for j in range(i, num_creatures):
delta_x = p_x[j] - p_x[i]
delta_y = p_y[j] - p_y[i]
distance_squared = (delta_x * delta_x) + (delta_y * delta_y)
sum_of_radii = radii[types[i]] + radii[types[i]]
if distance_squared < sum_of_radii * sum_of_radii:
pass
acceleration = .1
creature_radius = 10
spacing = 20
food_radius = 3
max_num_creatures = 1500000
num_creatures = 0
max_num_food = 500
num_food = 0
max_num_entities = max_num_creatures + max_num_food
num_entities = 0
cycles = 2
p_x = np.zeros(max_num_entities, dtype=np.float32)
p_y = np.zeros(max_num_entities, dtype=np.float32)
radii = np.array([creature_radius, creature_radius, food_radius], dtype=np.float32)
types = np.zeros(max_num_entities, dtype=np.uint8)
velocities = np.zeros(max_num_creatures, dtype=np.float32)
max_velocities = np.zeros(max_num_creatures, dtype=np.float32)
# types:
# male - 0
# female - 1
# food - 2
for x in range(1, 80000 // spacing):
for y in range(1, 6000 // spacing):
if num_creatures % 2 == 0:
types[num_creatures] = 0
else:
types[num_creatures] = 1
p_x[num_creatures] = x * spacing
p_y[num_creatures] = y * spacing
max_velocities[num_creatures] = 5
num_creatures += 1
device_p_x = cuda.to_device(p_x)
device_p_y = cuda.to_device(p_y)
device_p_x_new = cuda.to_device(p_x)
device_p_y_new = cuda.to_device(p_y)
device_radii = cuda.to_device(radii)
device_types = cuda.to_device(types)
device_velocities = cuda.to_device(velocities)
device_max_velocities = cuda.to_device(max_velocities)
threadsperblock = 64
blockspergrid = (num_creatures // threadsperblock) + 1
for i in range(cycles):
if i % 2 == 0:
update[blockspergrid, threadsperblock](device_p_x, device_p_y, device_p_x_new, device_p_y_new, device_radii, device_types, device_velocities, device_max_velocities, acceleration, num_creatures)
else:
update[blockspergrid, threadsperblock](device_p_x_new, device_p_y_new, device_p_x, device_p_y, device_radii, device_types, device_velocities, device_max_velocities, acceleration, num_creatures)
print(device_p_x_new.copy_to_host())
print(device_p_x.copy_to_host())
$ python t11.py
[ 20.05402946 20.05402946 20.05402946 ..., 0. 0. 0. ]
[ 20.1620903 20.1620903 20.1620903 ..., 0. 0. 0. ]
$
It is still necessary to preserve the kernel-calling loop in cycles
in host code, since we still require the global sync provided by the kernel call. But for a given number of cycles
, this will reduce the contribution of the kernel-call overhead.
Using this technique, care must be taken in the choice of cycles
as well as the use of data from either the p_x
or p_x_new
buffer, for coherent results.
来源:https://stackoverflow.com/questions/52129778/cuda-parallelize-kernel