I am aware of a similar question, but I want to ask for people opinion on my algorithm to sum floating point numbers as accurately as possible with practical costs.
If you are concerned about reducing the numerical error in your summation then you may be interested in Kahan's algorithm.
My guess is that your binary decomposition will work almost as well as Kahan summation.
Here is an example to illustrate it:
#include <stdio.h>
#include <stdlib.h>
#include <algorithm>
void sumpair( float *a, float *b)
{
volatile float sum = *a + *b;
volatile float small = sum - std::max(*a,*b);
volatile float residue = std::min(*a,*b) - small;
*a = sum;
*b = residue;
}
void sumpairs( float *a,size_t size, size_t stride)
{
if (size <= stride*2 ) {
if( stride<size )
sumpair(a+i,a+i+stride);
} else {
size_t half = 1;
while(half*2 < size) half*=2;;
sumpairs( a , half , stride );
sumpairs( a+half , size-half , stride );
}
}
void sumpairwise( float *a,size_t size )
{
for(size_t stride=1;stride<size;stride*=2)
sumpairs(a,size,stride);
}
int main()
{
float data[10000000];
size_t size= sizeof data/sizeof data[0];
for(size_t i=0;i<size;i++) data[i]=((1<<30)*-1.0+random())/(1.0+random());
float naive=0;
for(size_t i=0;i<size;i++) naive+=data[i];
printf("naive sum=%.8g\n",naive);
double dprec=0;
for(size_t i=0;i<size;i++) dprec+=data[i];
printf("dble prec sum=%.8g\n",(float)dprec);
sumpairwise( data , size );
printf("1st approx sum=%.8g\n",data[0]);
sumpairwise( data+1 , size-1);
sumpairwise( data , 2 );
printf("2nd approx sum=%.8g\n",data[0]);
sumpairwise( data+2 , size-2);
sumpairwise( data+1 , 2 );
sumpairwise( data , 2 );
printf("3rd approx sum=%.8g\n",data[0]);
return 0;
}
I declared my operands volatile and compiled with -ffloat-store to avoid extra precision on x86 architecture
g++ -ffloat-store -Wl,-stack_size,0x20000000 test_sum.c
and get: (0.03125 is 1ULP)
naive sum=-373226.25
dble prec sum=-373223.03
1st approx sum=-373223
2nd approx sum=-373223.06
3rd approx sum=-373223.06
This deserve a little explanation.
The elements will be put into the heap in increasing order, so you can use two queues instead. This produces O(n) if the numbers are pre-sorted.
This pseudocode produces the same results as your algorithm and runs in O(n)
if the input is pre-sorted and the sorting algorithm detects that:
Queue<float> leaves = sort(arguments[0]).toQueue();
Queue<float> nodes = new Queue();
popAny = #(){
if(leaves.length == 0) return nodes.pop();
else if(nodes.length == 0) return leaves.pop();
else if(leaves.top() > nodes.top()) return nodes.pop();
else return leaves.pop();
}
while(leaves.length>0 || nodes.length>1) nodes.push(popAny()+popAny());
return nodes.pop();
Kahan's summation algorithm is significantly more precise than straightforward summation, and it runs in O(n) (somewhere between 1-4 times slower than straightforward summation depending how fast floating-point is compared to data access. Definitely less than 4 times slower on desktop hardware, and without any shuffling around of data).
Alternately, if you are using the usual x86 hardware, and if your compiler allows access to the 80-bit long double
type, simply use the straightforward summation algorithm with the accumulator of type long double
. Only convert the result to double
at the very end.
If you really need a lot of precision, you can combine the above two solutions by using long double
for variables c
, y
, t
, sum
in Kahan's summation algorithm.