The precision of a large floating point sum

问题

I am trying to sum a sorted array of positive decreasing floating points. I have seen that the best way to sum them is to start adding up numbers from lowest to highest. I wrote this code to have an example of that, however, the sum that starts on the highest number is more precise. Why? (of course, the sum 1/k^2 should be f=1.644934066848226).

#include <stdio.h>
#include <math.h>

int main() {

    double sum = 0;
    int n;
    int e = 0;
    double r = 0;
    double f = 1.644934066848226;
    double x, y, c, b;
    double sum2 = 0;

    printf("introduce n\n");
    scanf("%d", &n);

    double terms[n];

    y = 1;

    while (e < n) {
        x = 1 / ((y) * (y));
        terms[e] = x;
        sum = sum + x;
        y++;
        e++;
    }

    y = y - 1;
    e = e - 1;

    while (e != -1) {
        b = 1 / ((y) * (y));
        sum2 = sum2 + b;
        e--;
        y--;
    }
    printf("sum from biggest to smallest is %.16f\n", sum);
    printf("and its error %.16f\n", f - sum);
    printf("sum from smallest to biggest is %.16f\n", sum2);
    printf("and its error %.16f\n", f - sum2);
    return 0;
}

回答1:

Your code creates an array double terms[n]; on the stack, and this puts a hard limit on the number of iterations that can be performed before your program crashes.

But you don't even fetch anything from this array, so there's no reason to have it there at all. I altered your code to get rid of terms[]:

#include <stdio.h>

int main() {

    double pi2over6 = 1.644934066848226;
    double sum = 0.0, sum2 = 0.0;
    double y;
    int i, n;

    printf("Enter number of iterations:\n");
    scanf("%d", &n);

    y = 1.0;

    for (i = 0; i < n; i++) {
        sum += 1.0 / (y * y);
        y += 1.0;
    }

    for (i = 0; i < n; i++) {
        y -= 1.0;
        sum2 += 1.0 / (y * y);
    }
    printf("sum from biggest to smallest is %.16f\n", sum);
    printf("and its error %.16f\n", pi2over6 - sum);
    printf("sum from smallest to biggest is %.16f\n", sum2);
    printf("and its error %.16f\n", pi2over6 - sum2);
    return 0;

}

When this is run with, say, a billion iterations, the smallest-first approach is considerably more accurate:

Enter number of iterations:
1000000000
sum from biggest to smallest is 1.6449340578345750
and its error 0.0000000090136509
sum from smallest to biggest is 1.6449340658482263
and its error 0.0000000009999996

回答2:

When you add two floating-point numbers with different orders of magnitude, the lower order bits of the smallest number are lost.

When you sum from smallest to largest, the partial sums grow like Σ1/k² for k from N to n, i.e. approximately 1/n-1/N (in blue), to be compared to 1/n².

When you sum from largest to smallest, the partial sums grow like Σ1/k² for k from n to N, which is about π²/6-1/n (in green) to be compared to 1/n².

It is clear that the second case results in many more bit losses.

来源：https://stackoverflow.com/questions/49090286/the-precision-of-a-large-floating-point-sum

标签

floating-point

precision

numerical-methods