Floating point calculation is neither associative nor distributive on processors. So,
(a + b) + c
is not equal to a + (b + c)
and <
Edit: I've removed my old answer since I seem to have misunderstood OP's question. If you want to see it you can read the edit history.
I think the ideal solution would be to switch to having a separate accumulator for each thread. This avoids all locking, which should make a drastic difference to performance. You can simply sum the accumulators at the end of the whole operation.
Alternatively, if you insist on using a single accumulator, one solution is to use "fixed-point" rather than floating point. This can be done with floating-point types by including a giant "bias" term in your accumulator to lock the exponent at a fixed value. For example if you know the accumulator will never exceed 2^32, you can start the accumulator at 0x1p32
. This will lock you at 32 bits of precision to the left of the radix point, and 20 bits of fractional precision (assuming double
). If that's not enough precision, you could us a smaller bias (assuming the accumulator will not grow too large) or switch to long double
. If long double
is 80-bit extended format, a bias of 2^32 would give 31 bits of fractional precision.
Then, whenever you want to actually "use" the value of the accumulator, simply subtract out the bias term.