Packing two DWORDs into a QWORD to save store bandwidth

前端未结

关注

 1  872

Imagine a load-store loop like the following which loads DWORDs from non-contiguous locations and stores them contiguously:

top:
mov eax, DWORD


                      
              相关标签:


      
      
        
          1条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  悲&欢浪女        
                
              
                            
                2020-12-11 22:53
              
            
            
                                                                       

  It also seems like we could maybe get the shift for free by making the second load a QWORD load offset by -4, but then we are left clearing out garbage in the load DWORD.


If wider loads are ok for correctness and performance (cache-line splits...), we can use shld

top:
    mov eax, DWORD [rsi]
    mov rbx, QWORD [rdx-4]     ; unaligned(?) 64-bit load

    shld rax, rbx, 32          ; 1 uop on Intel SnB-family, 0.5c recip throughput
    mov QWORD [rdi], rax




MMX punpckldq mm0, [mem] micro-fuses on SnB-family (including Skylake).

top:
    movd       mm0, DWORD [rsi]
    punpckldq  mm0, QWORD [rdx]     ; 1 micro-fused uop on Intel SnB-family

    movq       QWORD [rdi], mm0

 ; required after the loop, making it only worth-while for long-running loops
 emms


punpckl instructions unfortunately have a vector-width memory operand, not half-width.  This often spoils them for uses where they'd otherwise be perfect (especially the SSE2 version where the 16B memory operand must be aligned).  But note that the MMX versions (with only a qword memory operand) don't have an alignment requirement.

You could also use the 128-bit AVX version, but that's even more likely to cross a cache line boundary and be slow.  (Skylake does not optimize by loading only the required 8 bytes; a loop with an aligned mov + vpunckldq xmm1, xmm0, [cache_line-8] runs at 1 iter per 2 clocks vs. 1 iter per clock for aligned.)  The AVX version is required to fault if the 16-byte load crosses into an unmapped page, so it couldn't just use a narrower load without extra support from the load port. :/

Such a frustrating and useless design decision (presumably made before load ports could zero-extend for free, and not fixed with AVX).  At least we have movhps as a replacement for memory-source punpcklqdq, but narrower widths that actually shuffle can't be replaced.



To avoid CL-splits, you could also use a separate movd load and punpckldq, or SSE4.1 pinsrd.  With this, there's no reason for MMX.

top:
    movd       xmm0, DWORD [rsi]

    movd       xmm1, DWORD [rdx]           ; SSE2
    punpckldq  xmm0, xmm1
    ; or pinsrd  xmm0, DWORD [rdx], 1      ; 2 uops not micro-fused

    movq       QWORD [rdi], xmm0




Obviously AVX2 vpgatherdd is a possibility, and may perform well on Skylake.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复