Effectiveness of GCC optmization on bit operations

后端未结

关注

 5  1357

Here are two ways to set an individual bit in C on x86-64:

inline void SetBitC(long *array, int bit) {
   //Pure C version
   *array |= 1<


                      
              相关标签:


      
      
        
          5条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  爱一瞬间的悲伤        
                
              
                            
                2021-02-19 14:44
              
            
            
                                                                       

  Why does GCC optimize so poorly for such a common operation? 


Prelude: Since the late 1980s, focus on compiler optimization has moved away from microbenchmarks which focus on individual operations and toward macrobenchmarks which focus on applications whose speed people care about.   These days most compiler writers are focused on macrobenchmarks, and developing good benchmark suites is something that is taken seriously.

Answer: Nobody on the gcc is using a benchmark where the difference between or and bts matters to the execution time of a real program.  If you can produce such a program, you might be able to get the attention of people in gcc-land.


  Am I doing something wrong with the C version?


No, this is perfectly good standard C.  Very readable and idiomatic, in fact.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  慢半拍i        
                
              
                            
                2021-02-19 14:45
              
            
            
                                                                       
Can you post the code that you are using to do the timing?  This sort of operation can be tricky to time accurately.

In theory the two code sequences should be equally fast, so the most likely explanation (to my mind) is that something is causing your timing code to give bogus results.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  一向        
                
              
                            
                2021-02-19 14:45
              
            
            
                                                                       
I think you're asking a lot of your optimizer. 

You might be able to help it out a little by doing a `register long z = 1L << bit;", then or-ing that with your array. 

However, I assume that by 90% more time, you're meaning that the C version takes 10 cycles and the asm version takes 5 cycles, right? How does the performance compare at -O2 or -O1?
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  情歌与酒        
                
              
                            
                2021-02-19 14:54
              
            
            
                                                                       
This is a very very common operation on embedded systems which are generally resource constrained. 10 Cycles vs 5 Cycles is a nasty performance penalty on such systems. There are many cases when one wants to access IO ports or use 16 or  32 bit registers as Boolean bit flags to save memory.

The fact is that if(bit_flags& 1<<12) is far more readable [and portable when implemented with a library] than the assembly equivalent. Likewise for IO_PINS|= 1<<5; These are unfortunately many times slower, so the awkward asm macros live on.

In many ways, the goals of embedded and userspace applications are opposite. The responsiveness of external communications (to a User Interface or Machine Interface) are of minor importance, while ensuring a control loop (eqiv. to a micro-bechmark) completes in minimal time is absolutely critical and can make or break a chosen processor or control strategy.

Obviously if one can afford a multi-GHz cpu and all the associated peripherals, chip-sets, etc needed to support that, one does not really need worry about low-level optimisation at all. A 1000x slower micro-controller in a real-time control system means that saving clock cycles is 1000x more important.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  Happy的楠姐        
                
              
                            
                2021-02-19 15:01
              
            
            
                                                                       
For such code:

#include <stdio.h>
#include <time.h>

int main() {
  volatile long long i = 0;
  time_t start = time (NULL);
  for (long long n = 0; n < (1LL << 32); n++) {
    i |= 1 << 10;
  }
  time_t end = time (NULL);
  printf("C took %ds\n", (int)(end - start));
  start = time (NULL);
  for (long long n = 0; n < (1LL << 32); n++) {
    __asm__ ("bts %[bit], %[i]"
                  : [i] "=r"(i)
                  : "[i]"(i), [bit] "i" (10));
  }
  end = time (NULL);
  printf("ASM took %ds\n", (int)(end - start));
}


the result was:

C took 12s
ASM took 10s


My flag was (-std=gnu99 -O2 -march=core2). Without the volatile the loop was optimized out. gcc 4.4.2.

No difference was with:

__asm__ ("bts %[bit], %[i]"
              : [i] "+m"(i)
              : [bit] "r" (10));


So probably the answer was - noone cares. In microbenchmark the only difference is the one between those two methods but in real life I belive such code does not take much CPU. 

Additionally for such code:

#include <stdio.h>
#include <time.h>

int main() {
  volatile long long i = 0;
  time_t start = time (NULL);
  for (long long n = 0; n < (1L << 32); n++) {
    i |= 1 << (n % 32);
  }
  time_t end = time (NULL);
  printf("C took %ds\n", (int)(end - start));
  start = time (NULL);
  for (long long n = 0; n < (1L << 32); n++) {
    __asm__ ("bts %[bit], %[i]"
                  : [i] "+m"(i)
                  : [bit] "r" (n % 32));
  }
  end = time (NULL);
  printf("ASM took %ds\n", (int)(end - start));
}


The result was:

C took 9s
ASM took 10s


Both results were 'stable'. Testing CPU 'Intel(R) Core(TM)2 Duo CPU T9600 @ 2.80GHz'.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复