Optimizing CUDA kernel interpolation with nonuniform node points

问题

ORIGINAL QUESTION

I have the following kernel performing an interpolation with nonuniform node points, and I would like to optimize it:

__global__ void interpolation(cufftDoubleComplex *Uj, double *points, cufftDoubleComplex *result, int N, int M)
{
    int i = threadIdx.x + blockDim.x * blockIdx.x;

    int PP;
    double P;
    const double alfa=(2.-1./cc)*pi_double-0.01;
    double phi_cap_s;
    cufftDoubleComplex temp;

    double cc_points=cc*points[i];
    double r_cc_points=rint(cc*points[i]);

    temp = make_cuDoubleComplex(0.,0.);

    if(i<M) {   
        for(int m=0; m<(2*K+1); m++) {
            P = (K*K-(cc_points-(r_cc_points+m-K))*(cc_points-(r_cc_points+m-K)));

            if(P>0.)  phi_cap_s = (1./pi_double)*((sinh(alfa*sqrt(P)))/sqrt(P));  
            if(P<0.)  phi_cap_s = (1./pi_double)*((sin(alfa*sqrt(-P)))/sqrt(-P));   
            if(P==0.) phi_cap_s = alfa/pi_double;        

            PP = modulo((r_cc_points + m -K ),(cc*N)); 
            temp.x = temp.x+phi_cap_s*Uj[PP].x; 
            temp.y = temp.y+phi_cap_s*Uj[PP].y; 
        } 

        result[i] = temp; 
    }
}

K and cc are constants, points contains the nodes and Uj the values to be interpolated. modulo is a function basically working as %, but properly extended to negative values. For a certain arrangement, the kernel call takes 2.3ms. I have verified that the most expensive parts are

            if(P>0.)  phi_cap_s = (1./pi_double)*((sinh(alfa*sqrt(P)))/sqrt(P));  
            if(P<0.)  phi_cap_s = (1./pi_double)*((sin(alfa*sqrt(-P)))/sqrt(-P));   
            if(P==0.) phi_cap_s = alfa/pi_double;

which takes about 40% of the total time, and

        PP = modulo((r_cc_points + m -K ),(cc*N)); 
        temp.x = temp.x+phi_cap_s*Uj[PP].x; 
        temp.y = temp.y+phi_cap_s*Uj[PP].y;

which takes about 60%. By the Visual Profiler, I have verified that the performance of the former is not influenced by the presence of the if statement. Please, note that I want double precision, so I'm avoiding the __exp() solution. I suspect that, for the latter, the "random" memory access Uj[PP] could be responsible of that much calculation percentage. Any suggestion on tricks or comments to reduce the computation time? Thanks in advance.

VERSION FOLLOWING COMMENTS AND ANSWERS

Following the suggestions kindly provided in the answers and comments, I ended up with the code below:

__global__ void interpolation(cufftDoubleComplex *Uj, double *points, cufftDoubleComplex *result, int N, int M)
{
    int i = threadIdx.x + blockDim.x * blockIdx.x;

    int PP;
    double P,tempd;
    const double alfa=(2.-1./cc)*pi_double-0.01;
    cufftDoubleComplex temp = make_cuDoubleComplex(0.,0.);

    double cc_points=cc*points[i];
    double r_cc_points=rint(cc_points);

    cufftDoubleComplex rtemp[(2*K+1)];
    double phi_cap_s[2*K+1];

    if(i<M) {   
     #pragma unroll //unroll the loop
     for(int m=0; m<(2*K+1); m++) {
         PP = modulo(((int)r_cc_points + m -K ),(cc*N)); 
            rtemp[m] = Uj[PP]; //2

         P = (K*K-(cc_points-(r_cc_points+(double)(m-K)))*(cc_points-(r_cc_points+(double)(m-K))));
         if(P<0.) {tempd=rsqrt(-P); phi_cap_s[m] = (1./pi_double)*((sin(alfa/tempd))*tempd);  }
         else if(P>0.) {tempd=rsqrt(P); phi_cap_s[m] = (1./pi_double)*((sinh(alfa/tempd))*tempd); }
         else phi_cap_s[m] = alfa/pi_double;  
     }

     #pragma unroll //unroll the loop
     for(int m=0; m<(2*K+1); m++) {
         temp.x = temp.x+phi_cap_s[m]*rtemp[m].x; 
           temp.y = temp.y+phi_cap_s[m]*rtemp[m].y; 
     } 

     result[i] = temp; 
     }
 }

In particular: 1) I moved the global memory variable Uj to the register rtemp array of size 2*K+1 (K is a constant equal to 6 in my case); 2) I moved the variable phi_cap_s to a 2*K+1 sized register; 3) I used the if ... else statements instead of the three previously used if's (the conditions P<0. and P>0. have the same occurrence probability); 3) I defined extra variables for the square root; 4) I used rsqrt instead of sqrt (as long as I know, the sqrt() is calculated by CUDA as 1/rsqrt());

I added each new feature once at a time, verifying the improvement against the original version, but I must say that none of them gave me any relevant improvement.

The execution speed is limited by: 1) the calculation of the sin/sinh functions (about 40% of the time); is there any way to calculate them in double precision arithmetics by somehow exploiting intrinsic math as a "starting guess"? 2) the fact that many threads end up to access the same global memory locations Uj[PP] due to the mapping index PP; one possibility to avoid it would be using shared memory, but this would imply a strong thread cooperation.

My question is. Am I done? Namely, is there any mean to improve the code? I profiled the code by the NVIDIA Visual Profiler and here are the results:

IPC = 1.939 (compute capability 2.1);
Global Memory Load Efficiency = 38.9%;
Global Memory Store Efficiency = 18.8%;
Warp Execution Efficiency = 97%;
Instruction Replay Overhead = 0.7%;

Finally, I would like to notice that this discussion is linked to the discussion at CUDA: 1-dimensional cubic spline interpolation in CUDA

VERSION USING SHARED MEMORY

I have made a feasibility study on using shared memory. I have considered N=64 so that the whole Uj fits the shared memory. Below is the code (basically is my original version)

    __global__ void interpolation_shared(cufftDoubleComplex *Uj, double *points, cufftDoubleComplex *result, int N, int M)
 {
         int i = threadIdx.x + blockDim.x * blockIdx.x;

     int PP;
     double P;
     const double alfa=(2.-1./cc)*pi_double-0.01;
     double phi_cap_s;
     cufftDoubleComplex temp;

     double cc_points=cc*points[i];
     double r_cc_points=rint(cc*points[i]);

     temp = make_cuDoubleComplex(0.,0.);

     __shared__ cufftDoubleComplex Uj_shared[128];

     if (threadIdx.x < cc*N) Uj_shared[threadIdx.x]=Uj[threadIdx.x];

     if(i<M) {  
         for(int m=0; m<(2*K+1); m++) {
         P = (K*K-(cc_points-(r_cc_points+m-K))*(cc_points-(r_cc_points+m-K)));

         if(P>0.)  phi_cap_s = (1./pi_double)*((sinh(alfa*sqrt(P)))/sqrt(P));  
         if(P<0.)  phi_cap_s = (1./pi_double)*((sin(alfa*sqrt(-P)))/sqrt(-P));  
         if(P==0.) phi_cap_s = alfa/pi_double;        

         PP = modulo((r_cc_points + m -K ),(cc*N)); 
         temp.x = temp.x+phi_cap_s*Uj_shared[PP].x; 
         temp.y = temp.y+phi_cap_s*Uj_shared[PP].y; 
      } 

      result[i] = temp; 
    }
 }

The result again does not improve significantly, although this might depend on the small size of the input array.

VERBOSE PTXAS OUTPUT

ptxas : info : Compiling entry function '_Z13interpolationP7double2PdS0_ii' for 'sm_20'
ptxas : info : Function properties for _Z13interpolationP7double2PdS0_ii
  352 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas : info : Used 55 registers, 456 bytes cumulative stack size, 52 bytes cmem[0]

VALUES OF P, FOR FIRST WARP AND m=0

 0.0124300933082964
 0.0127183892149176
 0.0135847002913749
 0.0161796378170038
 0.0155488126345702
 0.0138890822153499
 0.0121163187739057
 0.0119998374528905
 0.0131600831194518
 0.0109574866163769
 0.00962949548477354
 0.00695850974164358
 0.00446426651940612
 0.00423369284281705
 0.00632921297092537
 0.00655137618976198
 0.00810202954519923
 0.00597974034698723
 0.0076811348379735
 0.00604267951733561
 0.00402922460255439
 0.00111841719893846
 -0.00180949615796777
 -0.00246283218698551
 -0.00183256444286428
 -0.000462696661685413
 0.000725108980390132
 -0.00126793006072035
 0.00152263101649197
 0.0022499598348702
 0.00463681632275836
 0.00359856091027666

MODULO FUNCTION

__device__ int modulo(int val, int modulus)
{
   if(val > 0) return val%modulus;
   else
   {
       int P = (-val)%modulus;
       if(P > 0) return modulus -P;
       else return 0;
   }
}

MODULO FUNCTION OPTIMIZED ACCORDING TO ANSWER

__device__ int modulo(int val, int _mod)
{
    if(val > 0) return val&(_mod-1);
    else
    {
        int P = (-val)&(_mod-1);
        if(P > 0) return _mod -P;
        else return 0;
    }
}

回答1:

//your code above
cufftDoubleComplex rtemp[(2*K+1)] //if it fits into available registers, assumes K is a constant

if(i<M) {   
#pragma unroll //unroll the loop
    for(int m=0; m<(2*K+1); m++) {

        PP = modulo((r_cc_points + m -K ),(cc*N)); 
        rtemp[m] = Uj[PP]; //2
    }
#pragma unroll
    for(nt m=0; m<(2*K+1); m++) {
        P = (K*K-(cc_points-(r_cc_points+m-K))*(cc_points-(r_cc_points+m-K)));
        // 1
        if(P>0.)  phi_cap_s = (1./pi_double)*((sinh(alfa*sqrt(P)))/sqrt(P));  
        else if(P<0.)  phi_cap_s = (1./pi_double)*((sin(alfa*sqrt(-P)))/sqrt(-P));   
        else phi_cap_s = alfa/pi_double;  

        temp.x = temp.x+phi_cap_s*rtemp[m].x; //3
        temp.y = temp.y+phi_cap_s*rtemp[m].y; 
    }

    result[i] = temp; 
}

Explanation

Added else if and else as these conditions are mutually exclusive, if you could, you should order the statements after probability of occurrence. E.g. if P<0. most of the times, you should evaluate that first.
This will fetch the requested memory to multiple registers, what you did before may have certainly caused a block on that thread due to not having the memory available in time for calculation. And keep in mind that if one thread blocks in a warp, the whole warp is blocked. If not enough warps in the ready queue, the program will block until any warp is ready.
We have now moved the calculations further forward in time in order to compensate for the bad memory access, hopefully the calculations done previously have compensated for the bad access pattern.

The reason why this should work is the following:

A request from memory that is in GMEM is around >~400-600 ticks. If a thread tries to perform operations on memory that is not available at time, it will block. That means if each memory request does not live in L1-L2 each warp have to wait that time or more until it can continue.

What I suspect is that temp.x+phi_cap_s*Uj[PP].x is doing just that. By staging (step 2) each memory transfer to a register, and moving on to stage the next you will hide the latency by allowing you to do other work while the memory is transfered.

By the time you reach step 3 the memory is hopefully available or you have to wait less time.

If rtemp does not fit into the registers to achieve 100% occupancy you may have to do it in batches.

You could also try to make phi_cap_s into an array and put it into the first loop like this:

#pragma unroll //unroll the loop
    for(int m=0; m<(2*K+1); m++) {
        //stage memory first
        PP = modulo((r_cc_points + m -K ),(cc*N)); 
        rtemp[m] = Uj[PP]; //2

        P = (K*K-(cc_points-(r_cc_points+m-K))*(cc_points-(r_cc_points+m-K)));
        // 1
        if(P>0.)  phi_cap_s[m] = (1./pi_double)*((sinh(alfa*sqrt(P)))/sqrt(P));  
        else if(P<0.)  phi_cap_s[m] = (1./pi_double)*((sin(alfa*sqrt(-P)))/sqrt(-P));   
        else phi_cap_s[m] = alfa/pi_double; 

    }
#pragma unroll
    for(nt m=0; m<(2*K+1); m++) {
        temp.x = temp.x+phi_cap_s[m]*rtemp[m].x; //3
        temp.y = temp.y+phi_cap_s[m]*rtemp[m].y; 
    }

Edit

Expression

P = (K*K-(cc_points-(r_cc_points+(double)(m-K)))*(cc_points-(r_cc_points+(double)(m-K))));

Can be broken down into:

const double cc_diff = cc_points-r_cc_points;
double exp = cc_diff - (double)(m-K);
exp *= exp;
P = (K*K-exp);

Which may reduce the number of instructions used.

Edit 2

__global__ void interpolation(cufftDoubleComplex *Uj, double *points, cufftDoubleComplex *result, int N, int M)
{
    int i = threadIdx.x + blockDim.x * blockIdx.x;

    int PP;
    double P,tempd;


    cufftDoubleComplex rtemp[(2*K+1)];
    double phi_cap_s[2*K+1];

    if(i<M) {
         const double cc_points=cc*points[i];
         cufftDoubleComplex temp = make_cuDoubleComplex(0.,0.);

         const double alfa=(2.-1./cc)*pi_double-0.01;


         const double r_cc_points=rint(cc_points);
         const double cc_diff = cc_points-r_cc_points;

     #pragma unroll //unroll the loop
         for(int m=0; m<(2*K+1); m++) {
             PP = m-k; //reuse PP
             double exp = cc_diff - (double)(PP); //stage exp to be used later, will explain

             PP = modulo(((int)r_cc_points + PP ),(cc*N)); 
             rtemp[m] = Uj[PP]; //2


             exp *= exp;
             P = (K*K-exp);

             if(P<0.) {tempd=rsqrt(-P); phi_cap_s[m] = (1./pi_double)*((sin(alfa/tempd))*tempd);  }
             else if(P>0.) {tempd=rsqrt(P); phi_cap_s[m] = (1./pi_double)*((sinh(alfa/tempd))*tempd); }
             else phi_cap_s[m] = alfa/pi_double;  
         }

     #pragma unroll //unroll the loop
         for(int m=0; m<(2*K+1); m++) {
             temp.x = temp.x+phi_cap_s[m]*rtemp[m].x; 
             temp.y = temp.y+phi_cap_s[m]*rtemp[m].y; 
         } 

     result[i] = temp; 
     }
 }

What I have done is moved in all the calculations inside the if statement to free up some resources both in terms of calculations but also memory fetch, do not know the divergence you have on the first if statement if(i<M). As m-K appeared twice in the code, I first put it in PP to be used when you calculate exp and PP.

What else you can do is to try and order you instructions so that, if you set a variable, make as many instruction in in between the next usage of said variable as possible, as it takes ~20 tics for it to be set into the registers. Hence, I put the constant cc_diff at the top, however, as this is only a one of instruction, it may not show any benefit.

Modulo function

__device__ modulo(int val, int _mod) {
    int p = (val&(_mod-1));// as modulo is always the power of 2
    if(val < 0) {
        return _mod - p;
    } else {
        return p;
    }
}

As we have _mod always as an integer of power of 2(cc = 2, N = 64, cc*N = 128), we can use this function instead of the mod operator. This should be "much" faster. Check it though so I have the arithmetic correct. It is from Optimizing Cuda - Part II Nvidia page 14.

回答2:

One optimization you probably would want to look into is to use fast math. Use intrinsics math functions and compile with -use-fast-math option.

intrinsics math

来源：https://stackoverflow.com/questions/13941872/optimizing-cuda-kernel-interpolation-with-nonuniform-node-points

标签

cuda

interpolation