Don't understand why column addition faster than row in CUDA

问题

I started with CUDA and wrote two kernels for experiment. Whey both accept 3 pointers to array of n*n (matrix emulation) and n.

__global__
void th_single_row_add(float* a, float* b, float* c, int n) {
  int idx = blockDim.x * blockIdx.x * n + threadIdx.x * n;
  for (int i = 0; i < n; i ++) {
    if (idx + i >= n*n) return;
    c[idx + i] = a[idx + i] + b[idx + i];
  }
}

__global__
void th_single_col_add(float* a, float* b, float* c, int n) {
  int idx = blockDim.x * blockIdx.x + threadIdx.x;
  for (int i = 0; i < n; i ++) {
    int idx2 = idx + i * n;
    if (idx2 >= n*n) return;
    c[idx2] = a[idx2] + b[idx2];
  }
}

In th_single_row_add each thread sum rows on n elemnts, In th_single_col_add each thread sum columns. Here is profile on n = 1000 (1 000 000 elements)

986.29us  th_single_row_add(float*, float*, float*, int)
372.96us  th_single_col_add(float*, float*, float*, int)

As you see colums sum three times faster. I thought that because in the column variant all indexes in the loop are far from each other it should be slower, where I wrong?

回答1:

Threads in CUDA don't act individually, they are grouped together in warps of 32 threads. Those 32 threads execute in lockstep (usually). An instruction issued to one thread is issued to all 32 at the same time, in the same clock cycle.

If that instruction is an instruction that reads memory (for example), then up to 32 independent reads may be required/requested. The exact patterns of addresses needed to satisfy these read operations is determined by the code you write. If those addresses are all "adjacent" in memory, that will be an efficient read. If those addresses are somehow "scattered" in memory, that will be an inefficient read, and will be slower.

This basic concept just described is called "coalesced" access in CUDA. Your column-summing case allows for coalesced access across a warp, because the addresses generated by each thread in the warp are in adjacent columns, and the locations are adjacent in memory. Your row summing case breaks this. The addresses generated by each thread in the warp are not adjacent (they are "columnar", separated from each other by the width of your array) and are therefore not "coalesced".

The difference in performance is due to this difference in memory access efficiency.

You can study more about coalescing behavior in CUDA by studying an introductory treatment of CUDA optimization, such as here especially slides 44-54.

来源：https://stackoverflow.com/questions/58780710/dont-understand-why-column-addition-faster-than-row-in-cuda

标签

cuda