CUDA-parallelized raytracer: very low speedup

I'm coding a raytracer using (py)CUDA and I'm obtaining a really low speedup; for example, in a 1000x1000 image, the GPU-parallelized code is just 4 times faster than the sequential code, executed in the CPU.

For each ray I have to solve 5 equations (the raytracer generates images of black holes using the process described in this paper), so my setup is the following: each ray is computed in a separate block, where 5 threads compute the equations using shared memory. That is, if I want to generate an image with a width of W pixels and a height of H pixels, the setup is:

Grid: W blocks x H blocks.
Block: 5 threads.

The most expensive computation is the resolution of the equations, that I solve with a custom Runge Kutta 4-5 algorithm.

The code is quite long and hard to explain in such a short question, but you can see it in Github. The CUDA kernel is here and the Runge Kutta solver is here. The CPU version with the sequential version of the exact same solver can be found in the same repo.

The equations to solve involve several computations, and I'm afraid the CPU optimization of some functions like sin, cos and sqrt is causing the low speedup (?)

My machine specs are:

GPU: GeForce GTX 780
CPU: Intel Core i7 CPU 930 @ 2.80GHz

My questions are:

Is it normal to get a speedup of 3x or 4x in a GPU-parallelized raytracer against a sequential code?
Do you see anything wrong in the CUDA setup or in the code that could be causing this behaviour?
Am I missing something important?

I understand the question can be too specific, but if you need more information, just say it, I'll be glad to provide it.

talonmies

Is it normal to get a speedup of 3x or 4x in a GPU-parallelized raytracer against a sequential code?

How long is a piece of string? There is no answer to this question.

Do you see anything wrong in the CUDA setup or in the code that could be causing this behaviour?

Yes, as noted in comments, you are using a completely inappropriate block size which is wasting approximately 85% of the potential computational capacity of your GPU.

Am I missing something important?

Yes, the answer to this question. Setting correct execution parameters is about 50% of the practical performance tuning requirements in CUDA, and you should be able to obtain noticeable performance improvements just be selecting a sane block size. Beyond this, careful profiling should be your next port of call.

[This answer assembled from comments and added as community wiki entry to get this (very broad) question off the unanswered list in the absence of enough close votes to close it].

来源：https://stackoverflow.com/questions/39171823/cuda-parallelized-raytracer-very-low-speedup

标签

performance

cuda

gpgpu