Significant FMA performance anomaly experienced in the Intel Broadwell processor

前端未结

关注

 2  909

北海茫月 2020-12-05 00:38

Code1:

vzeroall
mov             rcx, 1000000
startLabel1:
vfmadd231ps     ymm0, ymm0, ymm0
vfmadd231ps     ymm1, ymm1, ymm1
vfmadd231ps     ymm2, ym


      
      
        
          2条回答        

        
                    
            
            
                         
                
              
              
                
                   一生所求
                                             
                
                
                (楼主)
            
              
              
                2020-12-05 00:51
              

            
            
                        
Update: previous version contained a 6 VPADDD instructions (vs 5 in the question), and the extra VPADDD caused imbalance on Broadwell. After it was fixed, Haswell, Broadwell and Skylake issue almost the same number of uops to ports 0, 1 and 5.

There is no port contamination, but uops are scheduled suboptimally, with the majority of uops going to Port 5 on Broadwell, and making it the bottleneck before Ports 0 and 1 are saturated.

To demonstrate what is going on, I suggest to (ab)use the demo on PeachPy.IO:


Open www.peachpy.io in Google Chrome (it wouldn't work in other browsers).
Replace the default code (which implements SDOT function) with the code below, which is literally your example ported to PeachPy syntax:

n = Argument(size_t)
x = Argument(ptr(const_float_))
incx = Argument(size_t)
y = Argument(ptr(const_float_))
incy = Argument(size_t)

with Function("sdot", (n, x, incx, y, incy)) as function:
    reg_n = GeneralPurposeRegister64()
    LOAD.ARGUMENT(reg_n, n)

    VZEROALL()

    with Loop() as loop:
        for i in range(15):
            ymm_i = YMMRegister(i)
            if i < 10:
                VFMADD231PS(ymm_i, ymm_i, ymm_i)
            else:
                VPADDD(ymm_i, ymm_i, ymm_i)
        DEC(reg_n)
        JNZ(loop.begin)

    RETURN()

I have a number of machines on different microarchitectures as a backend for PeachPy.io. Choose Intel Haswell, Intel Broadwell, or Intel Skylake and press "Quick Run". The system will compile your code, upload it to server, and visualize performance counters collected during execution.
Here is the uops distribution over execution ports on Intel Haswell:





And here is the same plot from Intel Broadwell:





Apparently, whatever was the flaw in uops scheduler, it was fixed in Intel Skylake, because port pressure on that machine is the same as on Haswell.

    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它2个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复