问题
This question have a lack of details. So, i decided to create another question instead edit this one. The new question is here: Can i parallelize my code or it is not worth?
I have a program running in CUDA, where one piece of the code is running within a loop (serialized, as you can see below). This piece of code is a search within an array that contain addresses and/or NULL pointers. All the threads execute this code below.
while (i < n) {
if (array[i] != NULL) {
return array[i];
}
i++;
}
return NULL;
Where n is the size of array and array is in shared memory. I'm only interested in the first address that is different from NULL (first match).
The whole code (i've posted only a piece, the whole code is big) is running fast, but the "heart" of the code (i.e, the part that is more repeated) is serialized, as you can see. I want to know if i can parallelize this part (the search) with some optimized algorithm.
Like i said, the program is already in CUDA (and the array in device), so it will not have memory transfers from host to device and vice versa.
My problem is: n is not big. Difficultly it will be greater than 8.
I've tried to parallelize it, but my "new" code took more time than the code above.
I was studying reduction and min operations, but i've checked that it's useful when n is big.
So, any tips? Can i parallelize it efficiently, i.e., with a low overhead?
回答1:
Keeping things simple, one of the major limiting factors of GPGPU code is memory management. In most computers copying memory to the device (GPU) is a slow process.
As illustrated by http://www.ncsa.illinois.edu/~kindr/papers/ppac09_paper.pdf:
"The key requirement for obtaining effective acceleration from GPU subroutine libraries is minimization of I/O between the host and the GPU."
This is because I/O operations between host and device are SLOW!
Tying this back to your problem, it doesn't really make sense to run on the GPU since the amount of data you mention is so small. You would spend more time running the memcpy routines than it would take to run on the CPU in the first place - especially since you mention you are only interested in the first match.
One common misconception that many people have is that 'if I run it on the GPU, it has more cores so will run faster' and this just isn't the case.
When deciding if it is worth porting to CUDA or OpenCL you must think about if the process is inherently parallel or not - are you processing very large amounts of data etc.?
回答2:
Since you say the array is a shared memory resource, the result of this search is the same for each thread of a block. This means a first and simple optimization would be to only let a single thread do the search. This will free all but the first warp of the block from doing any work (they still need to wait for the result, yet don't have to waste any computing resources):
__shared__ void *result = NULL;
if(tid == 0)
{
for(unsigned int i=0; i<n; ++i)
{
if (array[i] != NULL)
{
result = array[i];
break;
}
}
}
__syncthreads();
return result;
A step further would then be to let the threads perform the search in parallel as a classic intra-block reduction. If you can guarantee n to always be <= 64, you can do this in a single warp and don't need any synchronization during the search (except for the complete synchronization at the end, of course).
for(unsigned int i=n/2; i>32; i>>=1)
{
if(tid < i && !array[tid])
array[tid] = array[tid+i];
__syncthreads();
}
if(tid < 32)
{
if(n > 32 && !array[tid]) array[tid] = array[tid+32];
if(n > 16 && !array[tid]) array[tid] = array[tid+16];
if(n > 8 && !array[tid]) array[tid] = array[tid+8];
if(n > 4 && !array[tid]) array[tid] = array[tid+4];
if(n > 2 && !array[tid]) array[tid] = array[tid+2];
if(n > 1 && !array[tid]) array[tid] = array[tid+1];
}
__syncthreads();
return array[0];
Of course the example assumes n to be a power of two (and the array to be padded with NULLs accordingly), but feel free to tune it to your needs and optimize this further.
来源:https://stackoverflow.com/questions/17937438/can-i-parallelize-my-code-or-it-is-not-worth