Optimize for fast multiplication but slow addition: FMA and doubledouble

前端未结

关注

 3  1443

独厮守ぢ 2020-11-30 13:51

When I first got a Haswell processor I tried implementing FMA to determine the Mandelbrot set. The main algorithm is this:

intn = 0;
for(int32_t i=0; i


      
      
        
          3条回答        

        
                    
            
            
                         
                
              
              
                
                   轻奢々
                                             
                
                
                (楼主)
            
              
              
                2020-11-30 14:40
              

            
            
                        
To speed up the algorithm, I use a simplified version based on 2 fma, 1 mul and 2 add. I process 8 iterations that way. Then compute the escape radius and rollback the last 8 iterations if necessary.

The following critical loop X = X^2 + C written with x86 intrinsics is nicely unrolled by the compiler, and you will spot after unrolling that the 2 FMA operations not badly dependent of each other.

//  IACA_START;
for (j = 0; j < 8; j++) {
    Xrm = _mm256_mul_ps(Xre, Xim);
    Xtt = _mm256_fmsub_ps(Xim, Xim, Cre);
    Xrm = _mm256_add_ps(Xrm, Xrm);
    Xim = _mm256_add_ps(Cim, Xrm);
    Xre = _mm256_fmsub_ps(Xre, Xre, Xtt);
}       // for
//  IACA_END;


And then I compute the escape radius (|X| < threshold), which costs an other fma and another multiplication, only every 8 iterations.

cmp = _mm256_mul_ps(Xre, Xre);
cmp = _mm256_fmadd_ps(Xim, Xim, cmp);
cmp = _mm256_cmp_ps(cmp, vec_threshold, _CMP_LE_OS);
if (_mm256_testc_si256((__m256i) cmp, vec_one)) {
    i += 8;
    continue;
}


You mention "addition is slow", this is not exactly true, but you are right, multiplication throughput get higher and higher over time on recent architectures. 

Multiplication latencies and dependencies are the key. FMA has a throughput of 1 cycle and a latency of 5 cycles. the execution of independent FMA instructions can overlap.

Additions based on the result of a multiplication get the full latency hit. 

So you have to break these immediate dependencies by doing "code stitching" and compute 2 points in the same loop, and just interleave the code before checking with IACA what will be going on. The following code has 2 sets of variables (suffixed by 0 and 1 for X0=X0^2+C0, X1=X1^2+C1) and starts to fill the FMA holes 

for (j = 0; j < 8; j++) {
    Xrm0 = _mm256_mul_ps(Xre0, Xim0);
    Xrm1 = _mm256_mul_ps(Xre1, Xim1);
    Xtt0 = _mm256_fmsub_ps(Xim0, Xim0, Cre);
    Xtt1 = _mm256_fmsub_ps(Xim1, Xim1, Cre);
    Xrm0 = _mm256_add_ps(Xrm0, Xrm0);
    Xrm1 = _mm256_add_ps(Xrm1, Xrm1);
    Xim0 = _mm256_add_ps(Cim0, Xrm0);
    Xim1 = _mm256_add_ps(Cim1, Xrm1);
    Xre0 = _mm256_fmsub_ps(Xre0, Xre0, Xtt0);
    Xre1 = _mm256_fmsub_ps(Xre1, Xre1, Xtt1);
}       // for


To summarize, 


you can halve the number of instructions in your critical loop 
you can add more independent instructions and get advantage of high throughput vs. low latency of the multiplications and fused multiply and add.

    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它3个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复