Minimize floating point error when adding multiple floating point variables

In my c++ app i have a vector of doubles in the range (0,1) and i have to calculate its total as accurately as possible. It feels like this issue should have been addressed before, but i cant find anything.

Obviously iterating through each item on the vector and doing sum+=vect[i] accumulates a significant error if the vector size is large and there are items which are significantly smaller then the others.

My current solution is this function:

double sumDoubles(vector<double> arg)// pass by copy
{
  sort(arg.rbegin(),arg.rend());  // sort in reverse order
  for(int i=1;i<=arg.size();i*=2)
    for(int j=0;j<arg.size()-i;j+=(2*i))
        arg[j]+=arg[j+i];
  return arg[0];
}

Basically it sorts the input in ascending order and calculates pairwise sums:

a+b+c+d+e+f+g+h=((a+b)+(c+d))+((e+f)+(g+h))

Like constructing a binary tree, but doing it in place. Sorting should ensure that at each step the two summands are of comparable magnitude.

The code above does perform better than a single loop with accumulative sum. However i am curious if it is possible to increase precision further while not degrading performance too much.

One of the standard ways of approaching this issue is the Kahan summation algorithm. This algorithm reduces your worst-case error to being dependent upon your floating point precision rather than growing proportional to the length of your vector (and does it in O(n) time, albeit with more calculations per iteration).

Kahan summation will likely outperform your current sumDoubles due to your sorting every call, and will additionally improve pairwise summation's error growth of O(log n) to O(1). This said, if the sort is unnecessary, pairwise summation will likely outperform Kahan summation (due to the additional per-iteration math involved) with what may be (for your circumstances) minimal error growth.

You're supposed to sort by absolute value. Currently, -100000000.0 sorts before 0.000000001. The idea of sorting is that you can then add terms of the same magnitude. Adding -100000000.0 and +100000000.0 if perfectly fine, so they should sort close together, but adding -100000000.0 and 0.000000001 gets you into problems with accuracy.

A second problem with your algorithm is that your evaluation order is pretty horrible. As you correctly note, you've got a tree structure, but even then you can evaluate the subexpressions in any order. The most efficient order for memory access would be to add [0] and [1], then [2] and [3], and then [0] and [2] before adding [4] and [5]. The reason is simple: you've got [0] and [2] still in cache, and their values will change. So don't write the intermediate values back to main memory just to read and modify them later. (In tree terms, that's DFS versus BFS evaluation)

For the same cache efficiency reasons, I'd also modify the location where you'd store the interim results. Sure, store [0] + [1] in [0]. But after that, [1] is not needed anymore. Store [2] + [3] in [1]. Now, add [0] and [1] again to get the sum of [0]..[3] stored in [0].

To summarize: add intermediate results together as soon as possible, to reduce the number of them, and store them in contiguous memory at the start of the array instead of scattered all around.

来源：https://stackoverflow.com/questions/19130042/minimize-floating-point-error-when-adding-multiple-floating-point-variables

标签

c++

floating-point

floating-accuracy