Is there a more efficient way to broadcast 4 contiguous doubles into 4 YMM registers?

后端未结
关注
 3  2076
滥情空心 2020-12-18 13:42
In a piece of C++ code that does something similar to (but not exactly) matrix multiplication, I load 4 contiguous doubles into 4 YMM registers like this:

      
      
        
          3条回答        

        
                    
            
            
                         
                
              
              
                
                   眼角桃花
                                             
                
                
                (楼主)
            
              
              
                2020-12-18 13:57
              

            
            
                        
TL;DR: It's almost always best to just do four broadcast-loads using _mm256_set1_pd().  This very good on Haswell and later, where vbroadcastsd ymm,[mem] doesn't require an ALU shuffle operation, and usually also the best option for Sandybridge/Ivybridge (where it's a 2-uop load + shuffle instruction).

It also means you don't need to care about alignment at all, beyond natural alignment for a double.

The first vector is ready sooner than if you did a two-step load + shuffle, so out-of-order execution can potentially get started on the code using these vectors while the first one is still loading.  AVX512 can even fold broadcast-loads into memory operands for ALU instructions, so doing it this way will allow a recompile to take slight advantage of AVX512 with 256b vectors.



(It's usually best to use set1(x), not _mm256_broadcast_sd(&x);  If the AVX2-only register-source form of vbroadcastsd isn't available, the compiler can choose to store -> broadcast-load or to do two shuffles.  You never know when inlining will mean your code will run on inputs that are already in registers.)

If you're really bottlenecked on load-port resource-conflicts or throughput, not total uops or ALU / shuffle resources, it might help to replace a pair of 64->256b broadcasts with a 16B->32B broadcast-load (vbroadcastf128/_mm256_broadcast_pd) and two in-lane shuffles (vpermilpd or vunpckl/hpd (_mm256_shuffle_pd)).

Or with AVX2: load 32B and use 4 _mm256_permute4x64_pd shuffles to broadcast each element into a separate vector.



Source Agner Fog's insn tables (and microarch pdf):

Intel Haswell and later:

vbroadcastsd ymm,[mem] and other broadcast-load insns are 1uop instructions that are handled entirely by a load port (the broadcast happens "for free").

The total cost of doing four broadcast-loads this way is 4 instructions. fused-domain: 4uops.  unfused-domain: 4 uops for p2/p3.  Throughput: two vectors per cycle.

Haswell only has one shuffle unit, on port5.  Doing all your broadcast-loads with load+shuffle will bottleneck on p5.

Maximum broadcast throughput is probably with a mix of vbroadcastsd ymm,m64 and shuffles:

## Haswell maximum broadcast throughput with AVX1
vbroadcastsd    ymm0, [rsi]
vbroadcastsd    ymm1, [rsi+8]
vbroadcastf128  ymm2, [rsi+16]     # p23 only on Haswell, also p5 on SnB/IvB
vunpckhpd       ymm3, ymm2,ymm2
vunpcklpd       ymm2, ymm2,ymm2
vbroadcastsd    ymm4, [rsi+32]     # or vaddpd ymm0, [rdx+something]
#add             rsi, 40


Any of these addressing modes can be two-register indexed addressing modes, because they don't need to micro-fuse to be a single uop.

AVX1: 5 vectors per 2 cycles, saturating p2/p3 and p5.  (Ignoring cache-line splits on the 16B load).  6 fused-domain uops, leaving only 2 uops per 2 cycles to use the 5 vectors...  Real code would probably use some of the load throughput to load something else (e.g. a non-broadcast 32B load from another array, maybe as a memory operand to an ALU instruction), or to leave room for stores to steal p23 instead of using p7.

## Haswell maximum broadcast throughput with AVX2
vmovups         ymm3, [rsi]
vbroadcastsd    ymm0, xmm3         # special-case for the low element; compilers should generate this from _mm256_permute4x64_pd(v, 0)
vpermpd         ymm1, ymm3, 0b01_01_01_01   # NASM syntax for 0x99
vpermpd         ymm2, ymm3, 0b10_10_10_10
vpermpd         ymm3, ymm3, 0b11_11_11_11
vbroadcastsd    ymm4, [rsi+32]
vbroadcastsd    ymm5, [rsi+40]
vbroadcastsd    ymm6, [rsi+48]
vbroadcastsd    ymm7, [rsi+56]
vbroadcastsd    ymm8, [rsi+64]
vbroadcastsd    ymm9, [rsi+72]
vbroadcastsd    ymm10,[rsi+80]    # or vaddpd ymm0, [rdx + whatever]
#add             rsi, 88


AVX2: 11 vectors per 4 cycles, saturating p23 and p5.  (Ignoring cache-line splits for the 32B load...).  Fused-domain: 12 uops, leaving 2 uops per 4 cycles beyond this.

I think 32B unaligned loads are a bit more fragile in terms of performance than unaligned 16B loads like vbroadcastf128.



Intel SnB/IvB:

vbroadcastsd ymm, m64 is 2 fused-domain uops: p5 (shuffle) and p23 (load).

vbroadcastss xmm, m32 and movddup xmm, m64 are single-uop load-port-only.  Interestingly, vmovddup ymm, m256 is also a single-uop load-port-only instruction, but like all 256b loads, it occupies a load port for 2 cycles.  It can still generate a store-address in the 2nd cycle.  This uarch doesn't deal well with cache-line splits for unaligned 32B-loads, though.  gcc defaults to using movups / vinsertf128 for unaligned 32B loads with -mtune=sandybridge / -mtune=ivybridge.

4x broadcast-load: 8 fused-domain uops: 4 p5 and 4 p23.  Throughput: 4 vectors per 4 cycles, bottlenecking on port 5.  Multiple loads from the same cache line in the same cycle don't cause a cache-bank conflict, so this is nowhere near saturating the load ports (also needed for store-address generation).  That only happens on the same bank of two different cache lines in the same cycle.

Multiple 2-uop instructions with no other instructions between is the worst case for the decoders if the uop-cache is cold, but a good compiler would mix in single-uop instructions between them.

SnB has 2 shuffle units, but only the one on p5 can handle shuffles that have a 256b version in AVX.  Using a p1 integer-shuffle uop to broadcast a double to both elements of an xmm register doesn't get us anywhere, since vinsertf128 ymm,ymm,xmm,i takes a p5 shuffle uop.

## Sandybridge maximum broadcast throughput: AVX1
vbroadcastsd    ymm0, [rsi]
add             rsi, 8


one per clock, saturating p5 but only using half the capacity of p23.

We can save one load uop at the cost of 2 more shuffle uops, throughput = two results per 3 clocks:

vbroadcastf128  ymm2, [rsi+16]     # 2 uops: p23 + p5 on SnB/IvB
vunpckhpd       ymm3, ymm2,ymm2    # 1 uop: p5
vunpcklpd       ymm2, ymm2,ymm2    # 1 uop: p5


Doing a 32B load and unpacking it with 2x vperm2f128 -> 4x vunpckh/lpd might help if stores are part of what's competing for p23.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它3个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复