CUDA: bad performance with shared memory and no parallelism

I'm trying to exploit shared memory in this kernel function, but the performance are not as good as I was expecting. This function is called many times in my application (about 1000 times or more), so I was thinking to exploit shared memory to avoid the memory latency. But something is wrong apparently because my application became really slow since i'm using shared memory.
This is the kernel:

__global__ void AndBitwiseOperation(int* _memory_device, int b1_size, int* b1_memory, int* b2_memory){
int j = 0;

// index GPU - Transaction-wise
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int tid = threadIdx.x;

// shared variable
extern __shared__ int shared_memory_data[];
extern __shared__ int shared_b1_data[];
extern __shared__ int shared_b2_data[];

// copy from global memory into shared memory and sync threads
shared_b1_data[tid] = b1_memory[tid];
shared_b2_data[tid] = b2_memory[tid];
__syncthreads();

// AND each int bitwise
for(j = 0; j < b1_size; j++)
    shared_memory_data[tid] = (shared_b1_data[tid] & shared_b2_data[tid]);

// write result for this block to global memory
_memory_device[i] = shared_memory_data[i];
}

The shared variables are declared extern because I don't know the size of b1 and b2 since they depend from the number of customer that I can only know at runtime (but both have the same size all the times).
This is how I call the kernel:

void Bitmap::And(const Bitmap &b1, const Bitmap &b2)
{

int* _memory_device;
int* b1_memory;
int* b2_memory;

int b1_size = b1.getIntSize();

// allocate memory on GPU
(cudaMalloc((void **)&b1_memory,  _memSizeInt * SIZE_UINT));
(cudaMalloc((void **)&b2_memory,  _memSizeInt * SIZE_UINT));
(cudaMalloc((void **)&_memory_device,  _memSizeInt * SIZE_UINT));

// copy values on GPU
(cudaMemcpy(b1_memory, b1._memory, _memSizeInt * SIZE_UINT, cudaMemcpyHostToDevice ));
(cudaMemcpy(b2_memory, b2._memory, _memSizeInt * SIZE_UINT, cudaMemcpyHostToDevice ));
(cudaMemcpy(_memory_device, _memory, _memSizeInt * SIZE_UINT, cudaMemcpyHostToDevice ));

dim3 dimBlock(1, 1);
dim3 dimGrid(1, 1);

AndBitwiseOperation<<<dimGrid, dimBlock>>>(_memory_device, b1_size, b1_memory, b2_memory);

// return values
(cudaMemcpy(_memory, _memory_device, _memSizeInt * SIZE_UINT, cudaMemcpyDeviceToHost ));

// Free Memory
(cudaFree(b1_memory));
(cudaFree(b2_memory));
(cudaFree(_memory_device));
}

b1 and b2 are bitmaps with 4 bits for each element. The number of elements depend from the number of customers. Also, I have problem with the kernel's parameters, because if I add some blocks or threads, the AndBitwiseOperation() is not giving me the correct result. With just 1 block and 1 thread per block the result is correct but the kernel is not in parallel.
Every advice is welcomed :)
Thank you

I did not really understood what your kernel wants to do.

You should read more about CUDA and GPU programming.

I tried to point out some of the mistakes.

Shared memory (sm) should reduce global memory reads. Analyze your global memory (gm) read and write operations per thread.

a. You read global memory two times and write sm two times
b. (nonsense loop ignored, no use of index) you read two times sn and write once sm
c. you read once sm and write once gm

So in total you have nothing gained. You could directly use the global memory.
You use all threads to write out one value at the block index "i". You should only use one thread to write this data out.
It makes no sense outputing the same data by multiple threads which will get serialized.
You use a loop and don't use the loop counter at all.
You write at "tid" and read at "i" randomly.
This assignement is overhead.
```
unsigned int tid = threadIdx.x;
```
The results cannot be correct with more then one block since with one block tid = i!
All the wrong indexing results in wrong calculation using more then one block

The shared memory at "i" was never written!

_memory_device[i] = shared_memory_data[i];

My assumption what your kernel should do

/*
 * Call kernel with x-block usage and up to 3D Grid
 */
__global__ void bitwiseAnd(int* outData_g, 
    const long long int inSize_s, 
    const int* inData1_g, 
    const int* inData2_g)
{
    //get unique block index
    const unsigned long long int blockId = blockIdx.x //1D
        + blockIdx.y * gridDim.x //2D
        + gridDim.x * gridDim.y * blockIdx.z; //3D

    //get unique thread index
    const unsigned long long int threadId = blockId * blockDim.x + threadIdx.x; 

    //check global unique thread range
    if(threadId >= inSize_s)
        return;

    //output bitwise and
    outData_g[thread] = inData1_g[thread] & inData2_g[thread];
}

When you declare an extern __shared__ array, you must also specify its size in the kernel call.

The kernel configuration is:

<<< Dg, Db, Ns, S >>>

Ns is the size of the extern __shared__ arrays, and defaults to 0.

I don't think you can define more than one extern __shared__ array in your kernel. An example in the Programming Guide defines a single extern __shared__ array and manually sets arrays with offsets within it:

extern __shared__ float array[]; 
__device__ void func()      // __device__ or __global__ function 
{ 
    short* array0 = (short*)array;  
    float* array1 = (float*)&array0[128]; 
    int*   array2 =   (int*)&array1[64]; 
}

来源：https://stackoverflow.com/questions/10263446/cuda-bad-performance-with-shared-memory-and-no-parallelism

标签

cuda

parallel-processing

shared-memory