CUDA Device To Device transfer expensive

前端 未结 2 1765
情歌与酒
情歌与酒 2020-12-18 08:31

I have written some code to try to swap quadrants of a 2D matrix for FFT purposes, that is stored in a flat array.

    int leftover = W-dcW;

    T *temp;
         


        
相关标签:
2条回答
  • 2020-12-18 08:47

    I ended up writing a kernel to do the swaps. This was indeed faster than the Device to Device memcpy operations

    0 讨论(0)
  • 2020-12-18 08:51

    Perhaps the following solution to perform the 2d fftshift in CUDA would be of interest:

    #define IDX2R(i,j,N) (((i)*(N))+(j))
    
    __global__ void fftshift_2D(double2 *data, int N1, int N2)
    {
        int i = threadIdx.y + blockDim.y * blockIdx.y;
        int j = threadIdx.x + blockDim.x * blockIdx.x;
    
        if (i < N1 && j < N2) {
            double a = pow(-1.0, (i+j)&1);
    
            data[IDX2R(i,j,N2)].x *= a;
            data[IDX2R(i,j,N2)].y *= a;
        }
    }
    

    It consists in multiplying the matrix to be transformed by a chessboard of 1s and -1s which is equivalent to the multiplication by exp(-j*(n+m)*pi) and thus to shifts in both directions in the conjugate domain.

    You have to call this kernel before and after the application of the CUFFT.

    One pro is that memory movements/swapping are avoided.

    IMPROVEMENT IN SPEED

    Following the suggestion received at the NVIDIA Forum, improved speed can be achieved as by changing the instruction

    double a = pow(-1.0,(i+j)&1);
    

    to

    double a = 1-2*((i+j)&1);
    

    to avoid the use of the slow routine pow.

    0 讨论(0)
提交回复
热议问题