Fast memory transpose with SSE, AVX, and OpenMP
问题 I need a fast memory transpose algorithm for my Gaussian convolution function in C/C++. What I do now is convolute_1D transpose convolute_1D transpose It turns out that with this method the filter size has to be large (or larger than I expected) or the transpose takes longer than the convolution (e.g. for a 1920x1080 matrix the convolution takes the same time as the transpose for a filter size of 35). The current transpose algorithm I am using uses loop blocking/tiling along with SSE and