Cumulative sum in two dimensions on array in nested loop — CUDA implementation?

问题

I have been thinking of how to perform this operation on CUDA using reductions, but I'm a bit at a loss as to how to accomplish it. The C code is below. The important part to keep in mind -- the variable precalculatedValue depends on both loop iterators. Also, the variable ngo is not unique to every value of m... e.g. m = 0,1,2 might have ngo = 1, whereas m = 4,5,6,7,8 could have ngo = 2, etc. I have included sizes of loop iterators in case it helps to provide better implementation suggestions.

// macro that translates 2D [i][j] array indices to 1D flattened array indices
#define idx(i,j,lda) ( (j) + ((i)*(lda)) )

int Nobs = 60480;
int NgS  = 1859;
int NgO  = 900;
// ngo goes from [1,900]

// rInd is an initialized (and filled earlier) as:
// rInd = new long int [Nobs];

for (m=0; m<Nobs; m++) {        
    ngo=rInd[m]-1;

    for (n=0; n<NgS; n++) {
            Aggregation[idx(n,ngo,NgO)] += precalculatedValue;
    }
}

In a previous case, when precalculatedValue was only a function of the inner loop variable, I saved the values in unique array indices and added them with a parallel reduction (Thrust) after the fact. However, this case has me stumped: the values of m are not uniquely mapped to the values of ngo. Thus, I don't see a way of making this code efficient (or even workable) to use a reduction on. Any ideas are most welcome.

回答1:

Just a stab...

I suspect that transposing your loops might help.

for (n=0; n<NgS; n++) {
    for (m=0; m<Nobs; m++) {            
        ngo=rInd[m]-1;
        Aggregation[idx(n,ngo,NgO)] += precalculatedValue(m,n);
    }
}

The reason I did this is because idx varies more rapidly with ngo (function of m) than with n, so making m the inner loop improves coherence. Note I also made precalculatedValue a function of (m, n) because you said that it is -- this makes the pseudocode clearer.

Then, you could start by leaving the outer loop on the host, and making a kernel for the inner loop (64,480-way parallelism is enough to fill most current GPUs).

In the inner loop, just start by using an atomicAdd() to handle collisions. If they are infrequent, on Fermi GPUs performance shouldn't be too bad. In any case, you will be bandwidth bound since arithmetic intensity of this computation is low. So once this is working, measure the bandwidth you are achieving, and compare to the peak for your GPU. If you are way off, then think about optimizing further (perhaps by parallelizing the outer loop -- one iteration per thread block, and do the inner loop using some shared memory and thread cooperation optimizations).

The key: start simple, measure performance, and then decide how to optimize.

Note that this calculation looks very similar to a histogram calculation, which has similar challenges, so you might want to google for GPU histograms to see how they have been implemented.

One idea is to sort (rInd[m], m) pairs using thrust::sort_by_key() and then (since the rInd duplicates will be grouped together), you can iterate over them and do your reductions without collisions. (This is one way to do histograms.) You could even do this with thrust::reduce_by_key().

来源：https://stackoverflow.com/questions/9527026/cumulative-sum-in-two-dimensions-on-array-in-nested-loop-cuda-implementation

标签

cuda

nested-loops

reduction

accumulator