问题
My question concerns the coalesced global writes to a dynamically changing set of elements of an array in CUDA. Consider the following kernel:
__global__ void
kernel (int n, int *odata, int *idata, int *hash)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n)
odata[hash[i]] = idata[i];
}
Here the first n elements of the array hash contain the indices of odata to be updated from the first n elements of idata. Obviously this leads to a terrible, terrible lack of coalescence. In the case of my code, the hash at one kernel invocation is completely unrelated to the hash at another (and other kernels update the data in other ways), so simply reordering the data to optimize this particular kenrel isn't an option.
Is there some feature in CUDA which would allow me to improve the performance of this situation? I hear a lot of talk about texture memory, but I've not been able to translate what I've read into a solution for this problem.
回答1:
Texturing is a read-only mechanism, so it cannot directly improve the performance of scattered writes to GMEM. If you were "hashing" like this instead:
odata[i] = idata[hash[i]];
(perhaps your algorithm can be transformed?)
Then there might be some benefit to considering a Texture mechanism. (Your example appears to be 1D in nature).
You might also make sure that your shared memory/L1 split is optimized towards cache. This won't help much with scattered writes though.
回答2:
Can you limit the range of hash outcome? For example, you may know that first 1K iterations of threads would access in the range of 0 to 8K of odata only.
If so, you can use shared memory. You may allocate block of shared memory and do fast scatter write on shared memory temporarily. Then write back the shared memory block to global memory in coalesced transactions.
来源:https://stackoverflow.com/questions/12938333/coalesced-global-memory-writes-using-hash