Fast memory transpose with SSE, AVX, and OpenMP

前端 未结 3 374
清歌不尽
清歌不尽 2020-12-30 08:22

I need a fast memory transpose algorithm for my Gaussian convolution function in C/C++. What I do now is

convolute_1D
transpose
convolute_1D
transpose
         


        
3条回答
  •  时光取名叫无心
    2020-12-30 09:07

    I'd guess that your best bet would be to try and combine the convolution and the transpose - i.e. write out the results of the convolve transposed as you go. You're almost certainly memory bandwidth limited on the transpose so reducing the number of instructions used for the transpose isn't really going to help (hence the lack of improvement from using AVX). Reducing the number of passes over your data is going to give you the best performance improvements.

提交回复
热议问题