Shared memory matrix multiplication kernel
问题 I am attempting to implement a shared memory based matrix multiplication kernel as outlined in the CUDA C Programming Guide. The following is the kernel: __global__ void matrixMultiplyShared(float * A, float * B, float * C, int ARows, int AColumns, int BRows, int BColumns, int CRows, int CColumns) { float * CSub = &C[CColumns * 16 * blockIdx.y + 16 * blockIdx.x]; float CValue = 0; for (int k = 0; k < (AColumns / 16); ++k) { float * ASub = &A[AColumns * 16 * blockIdx.y + 16 * k]; float * BSub