Is there a more efficient way to broadcast 4 contiguous doubles into 4 YMM registers?

后端未结
关注
 3  2080
滥情空心 2020-12-18 13:42
In a piece of C++ code that does something similar to (but not exactly) matrix multiplication, I load 4 contiguous doubles into 4 YMM registers like this:

      
      
        
          3条回答        

        
                    
            
            
                         
                
              
              
                
                   一个人的身影
                                             
                
                
                (楼主)
            
              
              
                2020-12-18 14:05
              

            
            
                        
Here is a variant built upon Z Boson's original answer (before edit), using two 128-bit loads instead of one 256-bit load.

__m256d b01 = _mm256_castpd128_pd256(_mm_load_pd(&b[4*k+0]));
__m256d b23 = _mm256_castpd128_pd256(_mm_load_pd(&b[4*k+2]));
__m256d b0101 = _mm256_permute2f128_pd(b01, b01, 0);
__m256d b2323 = _mm256_permute2f128_pd(b23, b23, 0);
__m256d b0000 = _mm256_permute_pd(b0101, 0);
__m256d b1111 = _mm256_permute_pd(b0101, 0xf);
__m256d b2222 = _mm256_permute_pd(b2323, 0);
__m256d b3333 = _mm256_permute_pd(b2323, 0xf);


In my case this is slightly faster than using one 256-bit load, possibly because the first permute can start before the second 128-bit load completes.



Edit: gcc compiles the two loads and the first 2 permutes into

vmovapd (%rdi),%xmm8
vmovapd 0x10(%rdi),%xmm4
vperm2f128 $0x0,%ymm8,%ymm8,%ymm1
vperm2f128 $0x0,%ymm4,%ymm4,%ymm2


Paul R's suggestion of using _mm256_broadcast_pd() can be written as:

__m256d b0101 = _mm256_broadcast_pd((__m128d*)&b[4*k+0]);
__m256d b2323 = _mm256_broadcast_pd((__m128d*)&b[4*k+2]);


which compiles into

vbroadcastf128 (%rdi),%ymm6
vbroadcastf128 0x10(%rdi),%ymm11


and is faster than doing two vmovapd+vperm2f128 (tested).

In my code, which is bound by vector execution ports instead of L1 cache accesses, this is still slightly slower than 4 _mm256_broadcast_sd(), but I imagine that L1 bandwidth-constrained code can benefit greatly from this.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它3个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复