Analyzing memory access coalescing of my CUDA kernel

前端未结

关注

 3  1914

野性不改 2020-12-10 17:07

I would like to read (BS_X+1)*(BS_Y+1) global memory locations by BS_x*BS_Y threads moving the contents to the shared memory and I have developed the following code.

3条回答

甜味超标 (楼主)

2020-12-10 17:24

I've evaluated the memory accesses of your code with a hand-written coalescing analyzer. The evaluation shows the code less exploits the coalescing. Here is the coalescing analyzer that you may find useful:

#include 
#include 

typedef struct dim3_t{
    int x;
    int y;
} dim3;


// KERNEL LAUNCH PARAMETERS
#define GRIDDIMX 4
#define GRIDDIMY 4
#define BLOCKDIMX 16
#define BLOCKDIMY 16


// ARCHITECTURE DEPENDENT
// number of threads aggregated for coalescing
#define COALESCINGWIDTH 32
// number of bytes in one coalesced transaction
#define CACHEBLOCKSIZE 128
#define CACHE_BLOCK_ADDR(addr,size)  (addr*size)&(~(CACHEBLOCKSIZE-1))


int main(){
    // fixed dim3 variables
    // grid and block size
    dim3 blockDim,gridDim;
    blockDim.x=BLOCKDIMX;
    blockDim.y=BLOCKDIMY;
    gridDim.x=GRIDDIMX;
    gridDim.y=GRIDDIMY;

    // counters
    int unq_accesses=0;
    int *unq_addr=(int*)malloc(sizeof(int)*COALESCINGWIDTH);
    int total_unq_accesses=0;

    // iter over total number of threads
    // and count the number of memory requests (the coalesced requests)
    int I, II, III;
    for(I=0; I



The code will run for every thread of the grid and calculates the number of merged requests, metric of memory access coalescing.

To use the analyzer, paste the index calculation portion of your code in the specified region and decompose the memory accesses (array) into 'address' and 'size'. I've already done this for your code where the indexings are:

int i       = threadIdx.x;
int j       = threadIdx.y;
int idx     = blockIdx.x*BLOCK_SIZE_X + threadIdx.x;
int idy     = blockIdx.y*BLOCK_SIZE_Y + threadIdx.y;

int index1  = j*BLOCK_SIZE_Y+i;

int i1      = (index1)%(BLOCK_SIZE_X+1);
int j1      = (index1)/(BLOCK_SIZE_Y+1);

int i2      = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)%(BLOCK_SIZE_X+1);
int j2      = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)/(BLOCK_SIZE_Y+1);


and the memory access is:

Ezx_h_shared_ext[i1][j1]=Ezx_h[(blockIdx.y*BLOCK_SIZE_Y+j1)*xdim+(blockIdx.x*BLOCK_SIZE_X+i1)];


The analyzer reports 4096 threads access to 4064 cache blocks. Run the code for your actual grid and block size and analyze the coalescing behavior.