Efficient way to compute geometric mean of many numbers

前端未结

关注

 7  1070

I need to compute the geometric mean of a large set of numbers, whose values are not a priori limited. The naive way would be

double geometric_mean(std::vect


                      
              相关标签:


      
      
        
          7条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  情深已故        
                
              
                            
                2020-12-14 07:29
              
            
            
                                                                       
The "split exponent and mantissa" solution:

double geometric_mean(std::vector<double> const & data)
{
    double m = 1.0;
    long long ex = 0;
    double invN = 1.0 / data.size();

    for (double x : data)
    {
        int i;
        double f1 = std::frexp(x,&i);
        m*=f1;
        ex+=i;
    }

    return std::pow( std::numeric_limits<double>::radix,ex * invN) * std::pow(m,invN);
}


If you are concerned that ex might overflow you can define it as a double instead of a long long, and multiply by invN at every step, but you might lose a lot of precision with this approach.

EDIT For large inputs, we can split the computation in several buckets:

double geometric_mean(std::vector<double> const & data)
{
    long long ex = 0;
    auto do_bucket = [&data,&ex](int first,int last) -> double
    {
        double ans = 1.0;
        for ( ;first != last;++first)
        {
            int i;
            ans *= std::frexp(data[first],&i);
            ex+=i;
        }
        return ans;
    };

    const int bucket_size = -std::log2( std::numeric_limits<double>::min() );
    std::size_t buckets = data.size() / bucket_size;

    double invN = 1.0 / data.size();
    double m = 1.0;

    for (std::size_t i = 0;i < buckets;++i)
        m *= std::pow( do_bucket(i * bucket_size,(i+1) * bucket_size),invN );

    m*= std::pow( do_bucket( buckets * bucket_size, data.size() ),invN );

    return std::pow( std::numeric_limits<double>::radix,ex * invN ) * m;
}

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  孤城傲影        
                
              
                            
                2020-12-14 07:29
              
            
            
                                                                       
Instead of using logarithms, which are very expensive, you can directly scale the results by powers of two.

double geometric_mean(std::vector<double> const&data) {
  double huge = scalbn(1,512);
  double tiny = scalbn(1,-512);
  int scale = 0;
  double product = 1.0;
  for(auto x:data) {
    if (x >= huge) {
      x = scalbn(x, -512);
      scale++;
    } else if (x <= tiny) {
      x = scalbn(x, 512);
      scale--;
    }
    product *= x;
    if (product >= huge) {
      product = scalbn(product, -512);
      scale++;
    } else if (product <= tiny) {
      product = scalbn(product, 512);
      scale--;
    }
  }
  return exp2((512.0*scale + log2(product)) / data.size());
}

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  梦毁少年i        
                
              
                            
                2020-12-14 07:43
              
            
            
                                                                       
You may be able to accelerate this by multiplying numbers as in your original solution and only converting to logarithms every certain number of multiplications (depending on the size of your initial numbers).
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  佛祖请我去吃肉        
                
              
                            
                2020-12-14 07:43
              
            
            
                                                                       
Summing logs to compute products stably is perfectly fine, and rather efficient (if this is not enough: there are ways to get vectorized logarithms with a few SSE operations -- there are also Intel MKL's vector operations).

To avoid overflow, a common technique is to divide every number by the maximum or minimum magnitude entry beforehand (or sum log differences to the log max or log min). You can also use buckets if the numbers vary a lot (eg. sum the log of small numbers and large numbers separately). Note that typically neither of this is needed except for very large sets since the log of a double is never huge (between say -700 and 700).

Also, you need to keep track of the signs separately.

Computing log x keeps typically the same number of significant digits as x, except when x is close to 1: you want to use std::log1p if you need to compute prod(1 + x_n) with small x_n.

Finally, if you have roundoff error problems when summing, you can use Kahan summation or variants.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  渐次进展        
                
              
                            
                2020-12-14 07:45
              
            
            
                                                                       
I think I figured out a way to do it, it combined the two routines in the question, similar to Peter's idea. Here is an example code.

double geometric_mean(std::vector<double> const&data)
{
    const double too_large = 1.e64;
    const double too_small = 1.e-64;
    double sum_log = 0.0;
    double product = 1.0;
    for(auto x:data) {
        product *= x;
        if(product > too_large || product < too_small) {
            sum_log+= std::log(product);
            product = 1;      
        }
    }
    return std::exp((sum_log + std::log(product))/data.size());
}


The bad news is: this comes with a branch. The good news: the branch predictor is likely to get this almost always right (the branch should only rarely be triggered).

The branch could be avoided using Peter's idea of a constant number of terms in the product. The problem with that is that overflow/underflow may still occur within only a few terms, depending on the values.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  [愿得一人]        
                
              
                            
                2020-12-14 07:48
              
            
            
                                                                       
There is simple idea to reduce computation and also to prevent overflow. You can group together numbers say atleast two at time and calculate their log and then evaluate their sum.

log(abcde) = 5*log(K)

log(ab) + log(cde)  = 5*log(k)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     1
2
下一页
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复