Do While don't work inside CUDA Kernel

怎甘沉沦 提交于 2019-12-13 04:37:36

问题


Ok, I'm pretty new into CUDA, and I'm kind of lost, really lost.

I'm trying to calculate pi using the Monte Carlo Method, and at the end I just get one add instead of 50.

I don't want to "do while" for calling the kernel, since it's too slow. My issue is, that my code don't loop, it executes only once in the kernel.

And also, I'd like that all the threads access the same niter and pi, so when some thread hit the counters all the others would stop.

#define SEED 35791246

__shared__ int niter;
__shared__ double pi;

__global__ void calcularPi(){

    double x;
    double y;
    int count;
    double z;

    count = 0;
    niter = 0;

    //keep looping
    do{

        niter = niter + 1;

        //Generate random number
        curandState state;
        curand_init(SEED,(int)niter, 0, &state);
        x = curand(&state);
        y = curand(&state);

        z = x*x+y*y;
        if (z<=1) count++;
     pi =(double)count/niter*4;

    }while(niter < 50);

}

int main(void){

    float tempoTotal;
    //Start timer
    clock_t t;
    t = clock();

    //call kernel
    calcularPi<<<1,32>>>();

    //wait while kernel finish
    cudaDeviceSynchronize();

    typeof(pi) piFinal;
    cudaMemcpyFromSymbol(&piFinal, "pi", sizeof(piFinal),0, cudaMemcpyDeviceToHost);

    typeof(niter) niterFinal;
    cudaMemcpyFromSymbol(&niterFinal, "niter", sizeof(niterFinal),0, cudaMemcpyDeviceToHost);

    //Ends timer
    t = clock() - t;
    tempoTotal = ((double)t)/CLOCKS_PER_SEC;
    printf("Pi: %g \n", piFinal);
    printf("Adds: %d \n", niterFinal);
    printf("Total time: %f \n", tempoTotal);

}

回答1:


There are a variety of issues with your code.

  1. I suggest using proper cuda error checking and run your code with cuda-memcheck to spot any runtime errors. I've omitted proper error checking in my code below for brevity of presentation, but I've run it with cuda-memcheck to indicate no runtime errors.

  2. Your usage of curand() is probably not correct (it returns integers over a large range). For this code to work correctly, you want a floating-point quantity between 0 and 1. The correct call for that is curand_uniform().

  3. Since you want all threads to work on the same values, you must prevent those threads from stepping on each other. One way to do that is to use atomic updates of the variables in question.

  4. It should not be necessary to re-run curand_init on each iteration. Once per thread should be sufficient.

  5. We don't use cudaMemcpy..Symbol operations on __shared__ variables. For convenience, and to preserve something that resembles your original code, I've elected to convert those to __device__ variables.

Here's a modified version of your code that has most of the above issues fixed:

$ cat t978.cu
#include <curand.h>
#include <curand_kernel.h>
#include <stdio.h>

#define ITER_MAX 5000
#define SEED 35791246

__device__ int niter;
__device__ int count;

__global__ void calcularPi(){

    double x;
    double y;
    double z;
    int lcount;
    curandState state;
    curand_init(SEED,threadIdx.x, 0, &state);
    //keep looping
    do{

        lcount = atomicAdd(&niter, 1);

        //Generate random number
        x = curand_uniform(&state);
        y = curand_uniform(&state);

        z = x*x+y*y;
        if (z<=1) atomicAdd(&count, 1);

    }while(lcount < ITER_MAX);

}

int main(void){

    float tempoTotal;
    //Start timer
    clock_t t;
    t = clock();
    int count_final = 0;
    int niter_final = 0;
    cudaMemcpyToSymbol(niter, &niter_final, sizeof(int));
    cudaMemcpyToSymbol(count, &count_final, sizeof(int));
    //call kernel
    calcularPi<<<1,32>>>();

    //wait while kernel finish
    cudaDeviceSynchronize();
    cudaMemcpyFromSymbol(&count_final, count, sizeof(int));
    cudaMemcpyFromSymbol(&niter_final, niter, sizeof(int));

    //Ends timer
    double pi = count_final/(double)niter_final*4;
    t = clock() - t;
    tempoTotal = ((double)t)/CLOCKS_PER_SEC;
    printf("Pi: %g \n", pi);
    printf("Adds: %d \n", niter_final);
    printf("Total time: %f \n", tempoTotal);

}
$ nvcc -o t978 t978.cu -lcurand
$ cuda-memcheck ./t978
========= CUDA-MEMCHECK
Pi: 3.12083
Adds: 5032
Total time: 0.558463
========= ERROR SUMMARY: 0 errors
$

I've modified the iterations to a larger number, but you can use 50 if you want for ITER_MAX.

Note that there are many criticisms that could be levelled against this code. My aim here, since it's clearly a learning exercise, is to point out what the minimum number of changes could be to get a functional code, using the algorithm you've outlined. As just one example, you might want to change your kernel launch config (<<<1,32>>>) to other, larger numbers, in order to more fully utilize the GPU.



来源:https://stackoverflow.com/questions/33848400/do-while-dont-work-inside-cuda-kernel

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!