Strategy for doing final reduction
问题 I am trying to implement an OpenCL version for doing reduction of a array of float. To achieve it, I took the following code snippet found on the web : __kernel void sumGPU ( __global const double *input, __global double *partialSums, __local double *localSums) { uint local_id = get_local_id(0); uint group_size = get_local_size(0); // Copy from global memory to local memory localSums[local_id] = input[get_global_id(0)]; // Loop for computing localSums for (uint stride = group_size/2; stride>0