CUDA - why is warp based parallel reduction slower?

情到浓时终转凉″ 提交于 2019-12-03 12:15:06

I think the reason your code is slower than mine is that in my code, half as many warps are active for each ADD in the first phase. In your code, all warps are active for all of the first phase. So overall your code executes more warp instructions. In CUDA it's important to consider total "warp instructions" executed, not just the number of instructions executed by one warp.

Also, there's no point in only using half of your warps. There is overhead in launching the warps only to have them evaluate two branches and exit.

Another thought is that the use of unsigned char and short might actually be costing you performance. I'm not sure, but it's certainly not saving you registers since they are not packed into single 32-bit variables.

Also, in my original code, I replaced blockDim.x with a template parameter, BLOCKDIM, which means that it only used 5 run-time if statements (the ifs in the second stage are eliminated by the compiler).

BTW, a cheaper way to compute your threadWarpId is

const int threadWarpId = threadIdx.x & 31;

You might check this article for more ideas.

EDIT: Here's an alternative warp-based block reduction.

template <typename T, int level>
__device__
void sumReduceWarp(volatile T *sdata, const unsigned int tid)
{
  T t = sdata[tid];
  if (level > 5) sdata[tid] = t = t + sdata[tid + 32];
  if (level > 4) sdata[tid] = t = t + sdata[tid + 16];
  if (level > 3) sdata[tid] = t = t + sdata[tid +  8];
  if (level > 2) sdata[tid] = t = t + sdata[tid +  4];
  if (level > 1) sdata[tid] = t = t + sdata[tid +  2];
  if (level > 0) sdata[tid] = t = t + sdata[tid +  1];
}

template <typename T>
__device__
void sumReduceBlock(T *output, volatile T *sdata)
{
  // sdata is a shared array of length 2 * blockDim.x

  const unsigned int warp = threadIdx.x >> 5;
  const unsigned int lane = threadIdx.x & 31;
  const unsigned int tid  = (warp << 6) + lane;

  sumReduceWarp<T, 5>(sdata, tid);
  __syncthreads();

  // lane 0 of each warp now contains the sum of two warp's values
  if (lane == 0) sdata[warp] = sdata[tid];

  __syncthreads();

  if (warp == 0) {
    sumReduceWarp<T, 4>(sdata, threadIdx.x);
    if (lane == 0) *output = sdata[0];
  }
}

This should be a bit faster because it uses all the warps that are launched in the first stage, and has no branching within the last stage, at the cost of an extra branch, shared load/store and __syncthreads() in the new middle stage. I haven't tested this code. If you run it, let me know how it performs. If you use a template for the blockDim in your original code it may again be faster, but I think this code is more succinct.

Note the temporary variable t is used because Fermi and later architectures use a pure load/store architecture, so += from shared memory to shared memory results in an extra load (since the sdata pointer must be volatile). Explicitly loading into the temporary once avoids this. On G80 it won't make a difference to performance.

You should also check the Examples in the SDK. I remember one very nice example with implementations of several ways of reductions. At least one of those also uses warp based reduction.

(I can't look up the name right now, because I have it only installed on my other machine)

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!