I need a fast memory transpose algorithm for my Gaussian convolution function in C/C++. What I do now is
convolute_1D
transpose
convolute_1D
transpose
I'd guess that your best bet would be to try and combine the convolution and the transpose - i.e. write out the results of the convolve transposed as you go. You're almost certainly memory bandwidth limited on the transpose so reducing the number of instructions used for the transpose isn't really going to help (hence the lack of improvement from using AVX). Reducing the number of passes over your data is going to give you the best performance improvements.