CUDA: bad performance with shared memory and no parallelism

我们两清 提交于 2019-12-07 22:43:27

I did not really understood what your kernel wants to do.

You should read more about CUDA and GPU programming.

I tried to point out some of the mistakes.

  1. Shared memory (sm) should reduce global memory reads. Analyze your global memory (gm) read and write operations per thread.

    a. You read global memory two times and write sm two times
    b. (nonsense loop ignored, no use of index) you read two times sn and write once sm
    c. you read once sm and write once gm

    So in total you have nothing gained. You could directly use the global memory.

  2. You use all threads to write out one value at the block index "i". You should only use one thread to write this data out.
    It makes no sense outputing the same data by multiple threads which will get serialized.

  3. You use a loop and don't use the loop counter at all.

  4. You write at "tid" and read at "i" randomly.

  5. This assignement is overhead.

    unsigned int tid = threadIdx.x;
    
  6. The results cannot be correct with more then one block since with one block tid = i!
    All the wrong indexing results in wrong calculation using more then one block

  7. The shared memory at "i" was never written!

    _memory_device[i] = shared_memory_data[i];
    

My assumption what your kernel should do

/*
 * Call kernel with x-block usage and up to 3D Grid
 */
__global__ void bitwiseAnd(int* outData_g, 
    const long long int inSize_s, 
    const int* inData1_g, 
    const int* inData2_g)
{
    //get unique block index
    const unsigned long long int blockId = blockIdx.x //1D
        + blockIdx.y * gridDim.x //2D
        + gridDim.x * gridDim.y * blockIdx.z; //3D

    //get unique thread index
    const unsigned long long int threadId = blockId * blockDim.x + threadIdx.x; 

    //check global unique thread range
    if(threadId >= inSize_s)
        return;

    //output bitwise and
    outData_g[thread] = inData1_g[thread] & inData2_g[thread];
}

When you declare an extern __shared__ array, you must also specify its size in the kernel call.

The kernel configuration is:

<<< Dg, Db, Ns, S >>>

Ns is the size of the extern __shared__ arrays, and defaults to 0.

I don't think you can define more than one extern __shared__ array in your kernel. An example in the Programming Guide defines a single extern __shared__ array and manually sets arrays with offsets within it:

extern __shared__ float array[]; 
__device__ void func()      // __device__ or __global__ function 
{ 
    short* array0 = (short*)array;  
    float* array1 = (float*)&array0[128]; 
    int*   array2 =   (int*)&array1[64]; 
} 
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!