cuda: warp divergence overhead vs extra arithmetic

问题

Of course, warp divergence, via if and switch statements, is to be avoided at all costs on GPUs.

But what is the overhead of warp divergence (scheduling only some of the threads to execute certain lines) vs. additional useless arithmetic?

Consider the following dummy example:

verison 1:

__device__ int get_D (int A, int B, int C)
{
    //The value A is potentially different for every thread.

    int D = 0;

    if (A < 10)
        D = A*6;
    else if (A < 17)
        D = A*6 + B*2;
    else if (A < 26)
        D = A*6 + B*2 + C; 
    else 
        D = A*6 + B*2 + C*3;

    return D;
}

vs.

version 2:

__device__ int get_D (int A, int B, int C)
{
    //The value A is potentially different for every thread.

    return  A*6 + (A >= 10)*(B*2) + (A < 26)*C + (A >= 26)*(C*3);
}

My real scenario is more complicated (more conditions) but is the same idea.

Questions:

Is the overhead (in scheduling) of warp divergence so great that version 1) is slower than version 2?

Version 2 requires many more ALUs than version 1, and most of these are wasted on "multiplication by 0" (only a select few of the conditionals evaluate to 1 rather than 0). Does this tie up valuable ALUs in useless operations, delaying instructions in other warps?

回答1:

Concrete answers to questions like these are usually difficult to provide. There are many factors which influence the comparison analysis between the 2 cases:

You say A is potentially different for each thread, but the extent to which this is true will actually influence the comparison.
Overall, whether your code is compute bound or bandwidth bound certainly influences the answer. (If your code is bandwidth bound, there may be no performance difference between the two cases).
I know you've identified A, B, C, as integers, but a seemingly innocuous change like making them float could influence the answer significantly.

Fortunately there are profiling tools that can help give crisp, specific answers (or perhaps indicate that there isn't much difference between the two cases.) You've done a pretty good job of indentifying 2 specific cases you care about. Why not benchmark the 2? And if you want to dig deeper, the profiling tools can give statistics about instruction replay (comes about due to warp divergence) bandwidth/compute bound metrics, etc.

I have to take exception with this blanket statement:

Of course, warp divergence, via if and switch statements, is to be avoided at all costs on GPUs.

That's simply not true. The ability of the machine to handle divergent control flow is in fact a feature which allows us to program it in friendlier languages like C/C++, and in fact differentiates it from some other acceleration technologies that don't offer the programmer this flexibility.

Like any other optimization effort, you should focus your attention at the heavy lifting first. Does this code you've provided constitute the bulk of the work done by your application? It doesn't make sense, in most cases, to put this level of analytical effort into something that is basically glue code or not part of the main work of your app.

And if it is the bulk of the effort of your code, then the profiling tools are really a powerful way to get good meaningful answers that are likely to be more useful than trying to do an academic analysis.

Now for my stab at your questions:

Is the overhead (in scheduling) of warp divergence so great that version 1) is slower than version 2?

This will depend on the specific level of branching that actually occurs. In the worst case, with completely independent paths for 32 threads, the machine will completely serialize and you are in effect running at 1/32 the peak performance. A binary-decision-tree type subdivision of the threads cannot yield this worst case, but certainly can approach it by the end of the tree. It might be possible to observe more than a 50% slowdown on this code, possibly 80% or higher slowdown, due to complete thread divergence at the end. But it will depend statistically on how often the divergence actually occurs (i.e. it's data-dependent). In the worst case, I would expect version 2 to be faster.

Version 2 requires many more ALUs than version 1, and most of these are wasted on "multiplication by 0" (only a select few of the conditionals evaluate to 1 rather than 0). Does this tie up valuable ALUs in useless operations, delaying instructions in other warps?

float vs. int might actually help here, and might be something you could consider exploring perhaps. But the second case appears (to me) to have all the same comparisons as the first case, but a few extra multiplies. In the float case, the machine can do one multiply per thread per clock, so it's pretty fast. In the int case, it's slower, and you can see the specific instruction throughputs depending on architecture here. I wouldn't be overly concerned about that level of arithmetic. And again, it may make no difference at all if your app is memory bandwidth bound.

Another way to tease all this out would be to write kernels that compare the codes of interest, compile to ptx (nvcc -ptx ...) and compare the ptx instructions. This gives a much better idea of what the machine thread code will look like in each case, and if you just do something like an instruction count, you may find not much difference between the two cases (which should favor option 2 in that case).

来源：https://stackoverflow.com/questions/16739248/cuda-warp-divergence-overhead-vs-extra-arithmetic

标签

cuda

gpu

warp-scheduler