What is a good solution for calculating an average where the sum of all values exceeds a double's limits?

前端 未结 17 2155

I have a requirement to calculate the average of a very large set of doubles (10^9 values). The sum of the values exceeds the upper bound of a double, so does anyone know a

相关标签:
17条回答
  • 2020-11-29 18:47

    A double can be divided by a power of 2 without loss of precision. So if your only problem if the absolute size of the sum you could pre-scale your numbers before summing them. But with a dataset of this size, there is still the risk that you will hit a situation where you are adding small numbers to a large one, and the small numbers will end up being mostly (or completely) ignored.

    for instance, when you add 2.2e-20 to 9.0e20 the result is 9.0e20 because once the scales are adjusted so that they numbers can be added together, the smaller number is 0. Doubles can only hold about 17 digits, and you would need more than 40 digits to add these two numbers together without loss.

    So, depending on your data set and how many digits of precision you can afford to loose, you may need to do other things. Breaking the data into sets will help, but a better way to preserve precision might be to determine a rough average (you may already know this number). then subtract each value from the rough average before you sum it. That way you are summing the distances from the average, so your sum should never get very large.

    Then you take the average delta, and add it to your rough sum to get the correct average. Keeping track of the min and max delta will also tell you how much precision you lost during the summing process. If you have lots of time and need a very accurate result, you can iterate.

    0 讨论(0)
  • 2020-11-29 18:47

    I posted an answer to a question spawned from this one, realizing afterwards that my answer is better suited to this question than to that one. I've reproduced it below. I notice though, that my answer is similar to a combination of Bozho's and Anon.'s.

    As the other question was tagged language-agnostic, I chose C# for the code sample I've included. Its relative ease of use and easy-to-follow syntax, along with its inclusion of a couple of features facilitating this routine (a DivRem function in the BCL, and support for iterator functions), as well as my own familiarity with it, made it a good choice for this problem. Since the OP here is interested in a Java solution, but I'm not Java-fluent enough to write it effectively, it might be nice if someone could add a translation of this code to Java.


    Some of the mathematical solutions here are very good. Here's a simple technical solution.

    Use a larger data type. This breaks down into two possibilities:

    1. Use a high-precision floating point library. One who encounters a need to average a billion numbers probably has the resources to purchase, or the brain power to write, a 128-bit (or longer) floating point library.

      I understand the drawbacks here. It would certainly be slower than using intrinsic types. You still might over/underflow if the number of values grows too high. Yada yada.

    2. If your values are integers or can be easily scaled to integers, keep your sum in a list of integers. When you overflow, simply add another integer. This is essentially a simplified implementation of the first option. A simple (untested) example in C# follows

    class BigMeanSet{
        List<uint> list = new List<uint>();
    
        public double GetAverage(IEnumerable<uint> values){
            list.Clear();
            list.Add(0);
    
            uint count = 0;
    
            foreach(uint value in values){
                Add(0, value);
                count++;
            }
    
            return DivideBy(count);
        }
    
        void Add(int listIndex, uint value){
            if((list[listIndex] += value) < value){ // then overflow has ocurred
                if(list.Count == listIndex + 1)
                    list.Add(0);
                Add(listIndex + 1, 1);
            }
        }
    
        double DivideBy(uint count){
            const double shift = 4.0 * 1024 * 1024 * 1024;
    
            double rtn       = 0;
            long   remainder = 0;
    
            for(int i = list.Count - 1; i >= 0; i--){
                rtn *= shift;
                remainder <<= 32;
                rtn += Math.DivRem(remainder + list[i], count, out remainder);
            }
    
            rtn += remainder / (double)count;
    
            return rtn;
        }
    }
    

    Like I said, this is untested—I don't have a billion values I really want to average—so I've probably made a mistake or two, especially in the DivideBy function, but it should demonstrate the general idea.

    This should provide as much accuracy as a double can represent and should work for any number of 32-bit elements, up to 232 - 1. If more elements are needed, then the count variable will need be expanded and the DivideBy function will increase in complexity, but I'll leave that as an exercise for the reader.

    In terms of efficiency, it should be as fast or faster than any other technique here, as it only requires iterating through the list once, only performs one division operation (well, one set of them), and does most of its work with integers. I didn't optimize it, though, and I'm pretty certain it could be made slightly faster still if necessary. Ditching the recursive function call and list indexing would be a good start. Again, an exercise for the reader. The code is intended to be easy to understand.

    If anybody more motivated than I am at the moment feels like verifying the correctness of the code, and fixing whatever problems there might be, please be my guest.


    I've now tested this code, and made a couple of small corrections (a missing pair of parentheses in the List<uint> constructor call, and an incorrect divisor in the final division of the DivideBy function).

    I tested it by first running it through 1000 sets of random length (ranging between 1 and 1000) filled with random integers (ranging between 0 and 232 - 1). These were sets for which I could easily and quickly verify accuracy by also running a canonical mean on them.

    I then tested with 100* large series, with random length between 105 and 109. The lower and upper bounds of these series were also chosen at random, constrained so that the series would fit within the range of a 32-bit integer. For any series, the results are easily verifiable as (lowerbound + upperbound) / 2.

    *Okay, that's a little white lie. I aborted the large-series test after about 20 or 30 successful runs. A series of length 109 takes just under a minute and a half to run on my machine, so half an hour or so of testing this routine was enough for my tastes.

    For those interested, my test code is below:

    static IEnumerable<uint> GetSeries(uint lowerbound, uint upperbound){
        for(uint i = lowerbound; i <= upperbound; i++)
            yield return i;
    }
    
    static void Test(){
        Console.BufferHeight = 1200;
        Random rnd = new Random();
    
        for(int i = 0; i < 1000; i++){
            uint[] numbers = new uint[rnd.Next(1, 1000)];
            for(int j = 0; j < numbers.Length; j++)
                numbers[j] = (uint)rnd.Next();
    
            double sum = 0;
            foreach(uint n in numbers)
                sum += n;
    
            double avg = sum / numbers.Length;
            double ans = new BigMeanSet().GetAverage(numbers);
    
            Console.WriteLine("{0}: {1} - {2} = {3}", numbers.Length, avg, ans, avg - ans);
    
            if(avg != ans)
                Debugger.Break();
        }
    
        for(int i = 0; i < 100; i++){
            uint length     = (uint)rnd.Next(100000, 1000000001);
            uint lowerbound = (uint)rnd.Next(int.MaxValue - (int)length);
            uint upperbound = lowerbound + length;
    
            double avg = ((double)lowerbound + upperbound) / 2;
            double ans = new BigMeanSet().GetAverage(GetSeries(lowerbound, upperbound));
    
            Console.WriteLine("{0}: {1} - {2} = {3}", length, avg, ans, avg - ans);
    
            if(avg != ans)
                Debugger.Break();
        }
    }
    
    0 讨论(0)
  • 2020-11-29 18:48

    IMHO, the most robust way of solving your problem is

    1. sort your set
    2. split in groups of elements whose sum wouldn't overflow - since they are sorted, this is fast and easy
    3. do the sum in each group - and divide by the group size
    4. do the sum of the group's sum's (possibly calling this same algorithm recursively) - be aware that if the groups will not be equally sized, you'll have to weight them by their size

    One nice thing of this approach is that it scales nicely if you have a really large number of elements to sum - and a large number of processors/machines to use to do the math

    0 讨论(0)
  • 2020-11-29 18:50

    You could take the average of averages of equal-sized subsets of numbers that don't exceed the limit.

    0 讨论(0)
  • 2020-11-29 18:53

    Consider this:

    avg(n1)         : n1                               = a1
    avg(n1, n2)     : ((1/2)*n1)+((1/2)*n2)            = ((1/2)*a1)+((1/2)*n2) = a2
    avg(n1, n2, n3) : ((1/3)*n1)+((1/3)*n2)+((1/3)*n3) = ((2/3)*a2)+((1/3)*n3) = a3
    

    So for any set of doubles of arbitrary size, you could do this (this is in C#, but I'm pretty sure it could be easily translated to Java):

    static double GetAverage(IEnumerable<double> values) {
        int i = 0;
        double avg = 0.0;
        foreach (double value in values) {
            avg = (((double)i / (double)(i + 1)) * avg) + ((1.0 / (double)(i + 1)) * value);
            i++;
        }
    
        return avg;
    }
    

    Actually, this simplifies nicely into (already provided by martinus):

    static double GetAverage(IEnumerable<double> values) {
        int i = 1;
        double avg = 0.0;
        foreach (double value in values) {
            avg += (value - avg) / (i++);
        }
    
        return avg;
    }
    

    I wrote a quick test to try this function out against the more conventional method of summing up the values and dividing by the count (GetAverage_old). For my input I wrote this quick function to return as many random positive doubles as desired:

    static IEnumerable<double> GetRandomDoubles(long numValues, double maxValue, int seed) {
        Random r = new Random(seed);
        for (long i = 0L; i < numValues; i++)
            yield return r.NextDouble() * maxValue;
    
        yield break;
    }
    

    And here are the results of a few test trials:

    long N = 100L;
    double max = double.MaxValue * 0.01;
    
    IEnumerable<double> doubles = GetRandomDoubles(N, max, 0);
    double oldWay = GetAverage_old(doubles); // 1.00535024998431E+306
    double newWay = GetAverage(doubles); // 1.00535024998431E+306
    
    doubles = GetRandomDoubles(N, max, 1);
    oldWay = GetAverage_old(doubles); // 8.75142021696299E+305
    newWay = GetAverage(doubles); // 8.75142021696299E+305
    
    doubles = GetRandomDoubles(N, max, 2);
    oldWay = GetAverage_old(doubles); // 8.70772312848651E+305
    newWay = GetAverage(doubles); // 8.70772312848651E+305
    

    OK, but what about for 10^9 values?

    long N = 1000000000;
    double max = 100.0; // we start small, to verify accuracy
    
    IEnumerable<double> doubles = GetRandomDoubles(N, max, 0);
    double oldWay = GetAverage_old(doubles); // 49.9994879713857
    double newWay = GetAverage(doubles); // 49.9994879713868 -- pretty close
    
    max = double.MaxValue * 0.001; // now let's try something enormous
    
    doubles = GetRandomDoubles(N, max, 0);
    oldWay = GetAverage_old(doubles); // Infinity
    newWay = GetAverage(doubles); // 8.98837362725198E+305 -- no overflow
    

    Naturally, how acceptable this solution is will depend on your accuracy requirements. But it's worth considering.

    0 讨论(0)
  • 2020-11-29 18:58

    Please clarify the potential ranges of the values.

    Given that a double has a range ~= +/-10^308, and you're summing 10^9 values, the apparent range suggested in your question is values of the order of 10^299.

    That seems somewhat, well, unlikely...

    If your values really are that large, then with a normal double you've got only 17 significant decimal digits to play with, so you'll be throwing away about 280 digits worth of information before you can even think about averaging the values.

    I would also note (since no-one else has) that for any set of numbers X:

    mean(X) = sum(X[i] - c)  +  c
              -------------
                    N
    

    for any arbitrary constant c.

    In this particular problem, setting c = min(X) might dramatically reduce the risk of overflow during the summation.

    May I humbly suggest that the problem statement is incomplete...?

    0 讨论(0)
提交回复
热议问题