CUDA-parallelized raytracer: very low speedup
问题 I'm coding a raytracer using (py)CUDA and I'm obtaining a really low speedup; for example, in a 1000x1000 image, the GPU-parallelized code is just 4 times faster than the sequential code, executed in the CPU. For each ray I have to solve 5 equations (the raytracer generates images of black holes using the process described in this paper), so my setup is the following: each ray is computed in a separate block, where 5 threads compute the equations using shared memory. That is, if I want to