Sum a variable over all threads in a CUDA Kernel and return it to Host

末鹿安然 提交于 2020-06-09 05:18:04

问题


I new in cuda and I'm try to implement a Kernel to calculate the energy of my Metropolis Monte Carlo Simulation.

I'll put here the linear version of this function:

float calc_energy(struct frame frm, float L, float rc){
int i,j;
float E=0, rij, dx, dy, dz;
for(i=0; i<frm.natm; i++)
{
    for(j=i+1; j<frm.natm; j++)
    {
        dx = fabs(frm.conf[j][0] - frm.conf[i][0]);
        dy = fabs(frm.conf[j][1] - frm.conf[i][1]);
        dz = fabs(frm.conf[j][2] - frm.conf[i][2]);
        dx = dx - round(dx/L)*L;
        dy = dy - round(dy/L)*L;
        dz = dz - round(dz/L)*L;

        /*rij*/
        rij = sqrt(dx*dx + dy*dy + dz*dz);

        if (rij <= rc)
        {
            E = E + (4*((1/pow(rij,12))-(1/pow(rij,6))));
        }
    } 
}

return E;

Then I'm try to parallelize this using Cuda: This is my idea:

void calc_energy(frame* s, float L, float rc)
{

extern __shared__ float E;

int i = blockDim.x*blockIdx.x + threadIdx.x; 
int j = blockDim.y*blockIdx.y + threadIdx.y; 

float rij, dx, dy, dz;

dx = fabs(s->conf[j][0] - s->conf[i][0]);
dy = fabs(s->conf[j][1] - s->conf[i][1]);
dz = fabs(s->conf[j][2] - s->conf[i][2]);
dx = dx - round(dx/L)*L;
dy = dy - round(dy/L)*L;
dz = dz - round(dz/L)*L; 

rij = sqrt(dx*dx + dy*dy + dz*dz);

if (rij <= rc)
{
   E += (4*((1/pow(rij,12))-(1/pow(rij,6)))); //<- here is the big problem
}
} 

My main question is how to sum the variable E from each thread and return it to the host??. I intend to use as many thread and blocks as possible.

Obviously a part of the code is missing when the variable E is calculated.

I have read a few things about reduction methods, but I would like to know if this is necessary here.

I call the kernel using the following code:

 calc_energy<<<dimGrid,dimBlock>>>(d_state, 100, 5);

edit:

I understood that I needed to use reduction methods. CUB work great to me.

Continuing with the implementation of the code, I realized that I have a new problem, perhaps because of my lack of knowledge in this area.

In my nested loop, the variable (frm.natm) can reach values in the order of 10^5. thinking of my GPU (GTX 750ti) the number of Thread per block is 1024 and the number of Block per grid is 1024. If I understood correctly, the maximum number of runs in a kernel is 1024x1024 = 1048576 (less than that actually).

So if I need to do 10^5 x 10^5 = 10^10 calculations in my nested loop, what would be the best way to think of the algorithm? Choose a fixed number (that fits my GPU) and split the calculations would be a good idea?


回答1:


My main question is how to sum the variable E from each thread and return it to the host?

You will need to sum each threads calculation at a block level first using some form of block-wise parallel reduction (I recommend the CUB block wise reduction implementation for that).

Once each block has a partial sum from its threads, the block sums need to be combined. This can either be done on the atomically by one thread from each block, by a second kernel call (with one block), or on the host. How and where you will use the final sum will determine which of those options is the most optimal for your application.




回答2:


#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/reduce.h>
#include <thrust/functional.h>
#include <algorithm>
#include <cstdlib>

int main(void)
{
    thrust::host_vector<int> h_vec(100);
    std::generate(h_vec.begin(), h_vec.end(), rand);

    thrust::device_vector<int> d_vec = h_vec;
    int x = thrust::reduce(d_vec.begin(), d_vec.end(), 0, thrust::plus<int>());
    std::cout<< x<< std::endl;
    return 0;
}


来源:https://stackoverflow.com/questions/50574268/sum-a-variable-over-all-threads-in-a-cuda-kernel-and-return-it-to-host

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!