Why is uint_least16_t faster than uint_fast16_t for multiplication in x86_64?

前端未结
关注
 5  1688
醉梦人生 2021-01-11 14:28
The C standard is quite unclear about the uint_fast*_t family of types. On a gcc-4.4.4 linux x86_64 system, the types uint_fast16_t and uint_

      
      
        
          5条回答        

        
                    
            
            
                         
                
              
              
                
                   Happy的楠姐
                                             
                
                
                (楼主)
            
              
              
                2021-01-11 14:44
              

            
            
                        
I think that such a design decision is not simple to take. It depends on many factors. For the moment I don't take your experiment as conclusive, see below.

First of all there is no such thing like one single concept of what fast should mean. Here you emphasized on multiplication in place, which is just one particular point of view. 

Then x86_64 is an architecture and not a processor. So outcomes might be quite different for different processors in that family. I don't think that it would be sane that gcc would have the type decision depend on particular commandline switches that optimize for a given processor.

Now to come back to your example. I guess you have also looked at the assembler code? Did it e.g use SSE instructions to realize your code? Did you switch processor specific options on, something like -march=native?

Edit: I experimented a bit with your test program and if I leave it exactly as it is I can basically reproduce your measurements. But modifying and playing around with it I am even less convinced that it is conclusive.

E.g if I change the inner loop also to go downward, the assembler looks almost the same as before (but using decrement and a test against 0) but the execution takes about 50% more. So I guess the timing depends very much on the environment of the instruction that you want to benchmark, pipeline stalls, whatever. You'd have to bench codes of very different nature where the instructions are issued in different contexts, alignment problems and vectorization come to play, to make a decision what the appropriate types for the fast typedefs are.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它5个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复