Precise sum of floating point numbers

后端 未结 4 1462
花落未央
花落未央 2020-12-06 05:16

I am aware of a similar question, but I want to ask for people opinion on my algorithm to sum floating point numbers as accurately as possible with practical costs.

相关标签:
4条回答
  • 2020-12-06 05:28

    If you are concerned about reducing the numerical error in your summation then you may be interested in Kahan's algorithm.

    0 讨论(0)
  • 2020-12-06 05:32

    My guess is that your binary decomposition will work almost as well as Kahan summation.

    Here is an example to illustrate it:

    #include <stdio.h>
    #include <stdlib.h>
    #include <algorithm>
    
    void sumpair( float *a, float *b)
    {
        volatile float sum = *a + *b;
        volatile float small = sum - std::max(*a,*b);
        volatile float residue = std::min(*a,*b) - small;
        *a = sum;
        *b = residue;
    }
    
    void sumpairs( float *a,size_t size, size_t stride)
    {
        if (size <= stride*2 ) {
            if( stride<size )
                sumpair(a+i,a+i+stride);
        } else {
            size_t half = 1;
            while(half*2 < size) half*=2;;
            sumpairs( a , half , stride );
            sumpairs( a+half , size-half , stride );
        }
    }
    
    void sumpairwise( float *a,size_t size )
    {
        for(size_t stride=1;stride<size;stride*=2)
            sumpairs(a,size,stride);
    }
    
    int main()
    {
        float data[10000000];
        size_t size= sizeof data/sizeof data[0];
        for(size_t i=0;i<size;i++) data[i]=((1<<30)*-1.0+random())/(1.0+random());
    
        float naive=0;
        for(size_t i=0;i<size;i++) naive+=data[i];
        printf("naive      sum=%.8g\n",naive);
    
        double dprec=0;
        for(size_t i=0;i<size;i++) dprec+=data[i];
        printf("dble prec  sum=%.8g\n",(float)dprec);
    
        sumpairwise( data , size );
        printf("1st approx sum=%.8g\n",data[0]);
        sumpairwise( data+1 , size-1);
        sumpairwise( data , 2 );
        printf("2nd approx sum=%.8g\n",data[0]);
        sumpairwise( data+2 , size-2);
        sumpairwise( data+1 , 2 );
        sumpairwise( data , 2 );
        printf("3rd approx sum=%.8g\n",data[0]);
        return 0;
    }
    

    I declared my operands volatile and compiled with -ffloat-store to avoid extra precision on x86 architecture

    g++  -ffloat-store  -Wl,-stack_size,0x20000000 test_sum.c
    

    and get: (0.03125 is 1ULP)

    naive      sum=-373226.25
    dble prec  sum=-373223.03
    1st approx sum=-373223
    2nd approx sum=-373223.06
    3rd approx sum=-373223.06
    

    This deserve a little explanation.

    • I first display naive summation
    • Then double precision summation (Kahan is roughly equivalent to that)
    • The 1st approximation is the same as your binary decomposition. Except that I store the sum in data[0] and that I care of storing residues. This way, the exact sum of data before and after summation is unchanged
    • This enables me to approximate the error by summing the residues at 2nd iteration in order to correct the 1st iteration (equivalent to applying Kahan on binary summation)
    • By iterating further I can further refine the result and we see a convergence
    0 讨论(0)
  • 2020-12-06 05:39

    The elements will be put into the heap in increasing order, so you can use two queues instead. This produces O(n) if the numbers are pre-sorted.

    This pseudocode produces the same results as your algorithm and runs in O(n) if the input is pre-sorted and the sorting algorithm detects that:

    Queue<float> leaves = sort(arguments[0]).toQueue();
    Queue<float> nodes = new Queue();
    
    popAny = #(){
           if(leaves.length == 0) return nodes.pop();
      else if(nodes.length == 0) return leaves.pop();
      else if(leaves.top() > nodes.top()) return nodes.pop();
      else return leaves.pop();
    }
    
    while(leaves.length>0 || nodes.length>1) nodes.push(popAny()+popAny());
    
    return nodes.pop();
    
    0 讨论(0)
  • 2020-12-06 05:47

    Kahan's summation algorithm is significantly more precise than straightforward summation, and it runs in O(n) (somewhere between 1-4 times slower than straightforward summation depending how fast floating-point is compared to data access. Definitely less than 4 times slower on desktop hardware, and without any shuffling around of data).


    Alternately, if you are using the usual x86 hardware, and if your compiler allows access to the 80-bit long double type, simply use the straightforward summation algorithm with the accumulator of type long double. Only convert the result to double at the very end.


    If you really need a lot of precision, you can combine the above two solutions by using long double for variables c, y, t, sum in Kahan's summation algorithm.

    0 讨论(0)
提交回复
热议问题