Efficient way to compute geometric mean of many numbers

前端 未结 7 1058
栀梦
栀梦 2020-12-14 07:12

I need to compute the geometric mean of a large set of numbers, whose values are not a priori limited. The naive way would be

double geometric_mean(std::vect         


        
相关标签:
7条回答
  • 2020-12-14 07:29

    The "split exponent and mantissa" solution:

    double geometric_mean(std::vector<double> const & data)
    {
        double m = 1.0;
        long long ex = 0;
        double invN = 1.0 / data.size();
    
        for (double x : data)
        {
            int i;
            double f1 = std::frexp(x,&i);
            m*=f1;
            ex+=i;
        }
    
        return std::pow( std::numeric_limits<double>::radix,ex * invN) * std::pow(m,invN);
    }
    

    If you are concerned that ex might overflow you can define it as a double instead of a long long, and multiply by invN at every step, but you might lose a lot of precision with this approach.

    EDIT For large inputs, we can split the computation in several buckets:

    double geometric_mean(std::vector<double> const & data)
    {
        long long ex = 0;
        auto do_bucket = [&data,&ex](int first,int last) -> double
        {
            double ans = 1.0;
            for ( ;first != last;++first)
            {
                int i;
                ans *= std::frexp(data[first],&i);
                ex+=i;
            }
            return ans;
        };
    
        const int bucket_size = -std::log2( std::numeric_limits<double>::min() );
        std::size_t buckets = data.size() / bucket_size;
    
        double invN = 1.0 / data.size();
        double m = 1.0;
    
        for (std::size_t i = 0;i < buckets;++i)
            m *= std::pow( do_bucket(i * bucket_size,(i+1) * bucket_size),invN );
    
        m*= std::pow( do_bucket( buckets * bucket_size, data.size() ),invN );
    
        return std::pow( std::numeric_limits<double>::radix,ex * invN ) * m;
    }
    
    0 讨论(0)
  • 2020-12-14 07:29

    Instead of using logarithms, which are very expensive, you can directly scale the results by powers of two.

    double geometric_mean(std::vector<double> const&data) {
      double huge = scalbn(1,512);
      double tiny = scalbn(1,-512);
      int scale = 0;
      double product = 1.0;
      for(auto x:data) {
        if (x >= huge) {
          x = scalbn(x, -512);
          scale++;
        } else if (x <= tiny) {
          x = scalbn(x, 512);
          scale--;
        }
        product *= x;
        if (product >= huge) {
          product = scalbn(product, -512);
          scale++;
        } else if (product <= tiny) {
          product = scalbn(product, 512);
          scale--;
        }
      }
      return exp2((512.0*scale + log2(product)) / data.size());
    }
    
    0 讨论(0)
  • 2020-12-14 07:43

    You may be able to accelerate this by multiplying numbers as in your original solution and only converting to logarithms every certain number of multiplications (depending on the size of your initial numbers).

    0 讨论(0)
  • 2020-12-14 07:43

    Summing logs to compute products stably is perfectly fine, and rather efficient (if this is not enough: there are ways to get vectorized logarithms with a few SSE operations -- there are also Intel MKL's vector operations).

    To avoid overflow, a common technique is to divide every number by the maximum or minimum magnitude entry beforehand (or sum log differences to the log max or log min). You can also use buckets if the numbers vary a lot (eg. sum the log of small numbers and large numbers separately). Note that typically neither of this is needed except for very large sets since the log of a double is never huge (between say -700 and 700).

    Also, you need to keep track of the signs separately.

    Computing log x keeps typically the same number of significant digits as x, except when x is close to 1: you want to use std::log1p if you need to compute prod(1 + x_n) with small x_n.

    Finally, if you have roundoff error problems when summing, you can use Kahan summation or variants.

    0 讨论(0)
  • 2020-12-14 07:45

    I think I figured out a way to do it, it combined the two routines in the question, similar to Peter's idea. Here is an example code.

    double geometric_mean(std::vector<double> const&data)
    {
        const double too_large = 1.e64;
        const double too_small = 1.e-64;
        double sum_log = 0.0;
        double product = 1.0;
        for(auto x:data) {
            product *= x;
            if(product > too_large || product < too_small) {
                sum_log+= std::log(product);
                product = 1;      
            }
        }
        return std::exp((sum_log + std::log(product))/data.size());
    }
    

    The bad news is: this comes with a branch. The good news: the branch predictor is likely to get this almost always right (the branch should only rarely be triggered).

    The branch could be avoided using Peter's idea of a constant number of terms in the product. The problem with that is that overflow/underflow may still occur within only a few terms, depending on the values.

    0 讨论(0)
  • 2020-12-14 07:48

    There is simple idea to reduce computation and also to prevent overflow. You can group together numbers say atleast two at time and calculate their log and then evaluate their sum.

    log(abcde) = 5*log(K)
    
    log(ab) + log(cde)  = 5*log(k)
    
    0 讨论(0)
提交回复
热议问题