Matrix Transpose (with shared Memory) with arbitary size on Cuda C
问题 I can't figure out a way to transpose a non-squared matrix using shared memory in CUDA C. (I am new to CUDA C and C) On the website: https://devblogs.nvidia.com/efficient-matrix-transpose-cuda-cc/ an efficient way was shown how to transpose a matrix (Coalesced Transpose Via Shared Memory). But it only works for squared matrices. Also Code is provided on github (same as on the blog). On Stackoverflow there is a similar question. There TILE_DIM = 16 is set. But with that implementation every