Non-square matrix transpose with shared mem in CUDA

回眸只為那壹抹淺笑 提交于 2019-12-05 22:07:05

The problem was due to some indefined threads because the value for col_2 and row_2 was being assigned within an if() statement that no all threads were visiting.

To solve this situation we can give the value for col_2 and row_2 when we declare these variables and delete the homonimous compute that had place within the mentioned if():

__shared__  double tile[16*(16+1)];

int col = threadIdx.x + blockIdx.x * blockDim.x;
int row = threadIdx.y + blockIdx.y * blockDim.y;

int col_2 = blockIdx.y * blockDim.y + threadIdx.x;
int row_2 = blockIdx.x * blockDim.x + threadIdx.y;

int a_cols=tab_rows-a_rows; 
int tab_cols=2*tab_rows+2;

Thus, the rest of the code looks like this:

if( (col<a_cols) && (row<a_rows) ) 
{
    // Load the data into shared mem
    tile[threadIdx.x+threadIdx.y*(16+1)]=a[IDX2L(row,col,a_cols)];
    // Normal copy (+ offsets)
    tab[IDX2L(row,col+tab_rows+a_rows,tab_cols)]= tile[threadIdx.x+threadIdx.y*(16+1)];
}
__syncthreads();

if( (row_2<a_cols) && (col_2<a_rows) )
    // Transpose (+ other offsets)
    tab[IDX2L(row_2+a_rows,col_2+tab_rows,tab_cols)]= -tile[threadIdx.y+threadIdx.x*(16+1)];
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!