Determining the optimal value for #pragma unroll N in CUDA

我怕爱的太早我们不能终老 提交于 2019-12-12 14:27:17

问题


I understand how #pragma unroll works, but if I have the following example:

__global__ void
test_kernel( const float* B, const float* C, float* A_out)
{
  int j = threadIdx.x + blockIdx.x * blockDim.x;
  if (j < array_size) {
     #pragma unroll
     for (int i = 0; i < LIMIT; i++) {
       A_out[i] = B[i] + C[i];
     }
  }
}

I want to determine the optimal value for LIMITin the kernel above which will be launched with x number of threads and y number of blocks. The LIMIT can be anywhere from 2 to 1<<20. Since 1 million seems like a very big number for the variable (1 million loops unrolled will cause register pressure and I am not sure if the compiler will do that unroll), what is a "fair" number, if any? And how do I determine that limit?


回答1:


Your example kernel is completely serial and not in anyway a useful real world use case for loop unrolling, but let's restrict ourselves to the question of how much loop unrolling the compiler will perform.

Here is a compileable version of your kernel with a bit of template decoration:

template<int LIMIT>
__global__ void
test_kernel( const float* B, const float* C, float* A_out, int array_size)
{
  int j = threadIdx.x + blockIdx.x * blockDim.x;
  if (j < array_size) {
     #pragma unroll
     for (int i = 0; i < LIMIT; i++) {
       A_out[i] = B[i] + C[i];
     }
  }
}

template __global__ void test_kernel<4>(const float*, const float*, float*, int);
template __global__ void test_kernel<64>(const float*, const float*, float*, int);
template __global__ void test_kernel<256>(const float*, const float*, float*, int);
template __global__ void test_kernel<1024>(const float*, const float*, float*, int);
template __global__ void test_kernel<4096>(const float*, const float*, float*, int);
template __global__ void test_kernel<8192>(const float*, const float*, float*, int);

You can compile this to PTX and see for yourself that (at least with the CUDA 7 release compiler and the default compute capability 2.0 target architecture), the kernels with up to LIMIT=4096are fully unrolled. The LIMIT=8192 case is not unrolled. If you have more patience that I do, you can probably play around with the templating to find the exact compiler limit for this code, although I doubt that is particularly instructive to know.

You can also see for yourself via the compiler that all of the heavily unrolled versions use the same number of registers (because of the trivial nature of your kernel).




回答2:


CUDA takes advantage of thread-level parallelism, which you expose by splitting work into multiple threads, and instruction-level parallelism, which CUDA finds by searching for independent instructions in your compiled code.

@talonmies' result, showing that your loop might be unrolled somewhere between 4096 and 8192 iterations was surprising to me because loop unrolling has sharply diminishing returns on a modern CPU, where most iteration overhead has been optimized away with techniques such as branch prediction and speculative execution.

On a CPU, I doubt that there would be much to gain from unrolling more than, say, 10-20 iterations and an unrolled loop takes up more room in the instruction cache so there's a cost to unrolling as well. The CUDA compiler will be considering the cost/benefit tradeoff when determining how much unrolling to do. So the question is, what might be the benefit from unrolling 4096+ iterations? I think it might be because it gives the GPU more code in which it can search for independent instructions that it can then run concurrently, using instruction-level parallelism.

The body of your loop is A_out[i] = B[i] + C[i];. Since the logic in your loop does not access external variables and does not access results from earlier iterations of the loop, each iteration is independent from all other iterations. So i doesn't have to increase sequentially. The end result would be the same even if the loop iterated over each value of i between 0 and LIMIT - 1 in completely random order. That property makes the loop a good candidate for parallel optimization.

But there is a catch, and it's what I mentioned in the comment. The iterations of your loop are only independent if the A buffer is stored separately from your B and C buffers. If your A buffer partially or fully overlaps the B and/or C buffers in memory, a connection between different iterations is created. One iteration may now change the B and C input values for another iteration by writing to A. So you get different results depending on which of the two iterations runs first.

Multiple pointers pointing to the same locations in memory is called pointer aliasing. So, in general, pointer aliasing can cause "hidden" connections between sections of code that appear to be separate because writes done by one section of code through one pointer can alter values read by another section of code reading from another pointer. By default, CPU compilers generate code that take possible pointer aliasing into account, generating code that yields the correct result regardless. The question is what CUDA does, because, coming back to the talonmies' test results, the only reason I can see for such a large amount of unrolling is that it opens the code up for instruction level parallelism. But that then means that CUDA does not take pointer aliasing into account in this particular situation.

Re. your question about running more than a single thread, a regular serial program does not automatically become a parallel program when you increase the number of threads. You have to identify the portions of the work that can run in parallel and then express that in your CUDA kernel. That's what's called thread-level parallelism and it's the main source of performance increase for your code. In addition, CUDA will search for independent instructions in each kernel and may run those concurrently, which is the instruction-level parallelism. Advanced CUDA programmers may keep instruction-level parallelism in mind and write code that facilitates that, but we mortals should just focus on thread-level parallelism. That means that you should look at your code again and consider might be able to run in parallel. Since we already concluded that the body of your loop is a good candidate for parallelization, your job becomes rewriting the serial loop in your kernel to express to CUDA how to run separate iterations in parallel.



来源:https://stackoverflow.com/questions/34524991/determining-the-optimal-value-for-pragma-unroll-n-in-cuda

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!