I have a requirement to calculate the average of a very large set of doubles (10^9 values). The sum of the values exceeds the upper bound of a double, so does anyone know a
Apart from using the better approaches already suggested, you can use BigDecimal to make your calculations. (Bear in mind it is immutable)
First of all, make yourself familiar with the internal representation of double
values. Wikipedia should be a good starting point.
Then, consider that doubles are expressed as "value plus exponent" where exponent is a power of two. The limit of the largest double value is an upper limit of the exponent, and not a limit of the value! So you may divide all large input numbers by a large enough power of two. This should be safe for all large enough numbers. You can re-multiply the result with the factor to check whether you lost precision with the multiplication.
Here we go with an algorithm
public static double sum(double[] numbers) {
double eachSum, tempSum;
double factor = Math.pow(2.0,30); // about as large as 10^9
for (double each: numbers) {
double temp = each / factor;
if (t * factor != each) {
eachSum += each;
else {
tempSum += temp;
}
}
return (tempSum / numbers.length) * factor + (eachSum / numbers.length);
}
and dont be worried by the additional division and multiplication. The FPU will optimize the hell out of them since they are done with a power of two (for comparison imagine adding and removing digits at the end of a decimal numbers).
PS: in addition, you may want to use Kahan summation to improve the precision. Kahan summation avoids loss of precision when very large and very small numbers are summed up.
In order to keep logic simple, and keep performance not the best but acceptable, i recommend you to use BigDecimal together with the primitive type. The concept is very simple, you use primitive type to sum values together, whenever the value will underflow or overflow, you move the calculate value to the BigDecimal, then reset it for the next sum calculation. One more thing you should aware is when you construct BigDecimal, you ought to always use String instead of double.
BigDecimal average(double[] values){
BigDecimal totalSum = BigDecimal.ZERO;
double tempSum = 0.00;
for (double value : values){
if (isOutOfRange(tempSum, value)) {
totalSum = sum(totalSum, tempSum);
tempSum = 0.00;
}
tempSum += value;
}
totalSum = sum(totalSum, tempSum);
BigDecimal count = new BigDecimal(values.length);
return totalSum.divide(count);
}
BigDecimal sum(BigDecimal val1, double val2){
BigDecimal val = new BigDecimal(String.valueOf(val2));
return val1.add(val);
}
boolean isOutOfRange(double sum, double value){
// because sum + value > max will be error if both sum and value are positive
// so I adapt the equation to be value > max - sum
if(sum >= 0.00 && value > Double.MAX - sum){
return true;
}
// because sum + value < min will be error if both sum and value are negative
// so I adapt the equation to be value < min - sum
if(sum < 0.00 && value < Double.MIN - sum){
return true;
}
return false;
}
From this concept, every time the result is underflow or overflow, we will keep that value into the bigger variable, this solution might a bit slowdown the performance due to the BigDecimal calculation, but it guarantee the runtime stability.
So I don't repeat myself so much, let me state that I am assuming that the list of numbers is normally distributed, and that you can sum many numbers before you overflow. The technique still works for non-normal distros, but somethings will not meet the expectations I describe below.
--
Sum up a sub-series, keeping track of how many numbers you eat, until you approach the overflow, then take the average. This will give you an average a0, and count n0. Repeat until you exhaust the list. Now you should have many ai, ni.
Each ai and ni should be relatively close, with the possible exception of the last bite of the list. You can mitigate that by under-biting near the end of the list.
You can combine any subset of these ai, ni by picking any ni in the subset (call it np) and dividing all the ni in the subset by that value. The max size of the subsets to combine is the roughly constant value of the n's.
The ni/np should be close to one. Now sum ni/np * ai and multiple by np/(sum ni), keeping track of sum ni. This gives you a new ni, ai combination, if you need to repeat the procedure.
If you will need to repeat (i.e., the number of ai, ni pairs is much larger than the typical ni), try to keep relative n sizes constant by combining all the averages at one n level first, then combining at the next level, and so on.
A random sampling of a small set of the full dataset will often result in a 'good enough' solution. You obviously have to make this determination yourself based on system requirements. Sample size can be remarkably small and still obtain reasonably good answers. This can be adaptively computed by calculating the average of an increasing number of randomly chosen samples - the average will converge within some interval.
Sampling not only addresses the double overflow concern, but is much, much faster. Not applicable for all problems, but certainly useful for many problems.
You can calculate the mean iteratively. This algorithm is simple, fast, you have to process each value just once, and the variables never get larger than the largest value in the set, so you won't get an overflow.
double mean(double[] ary) {
double avg = 0;
int t = 1;
for (double x : ary) {
avg += (x - avg) / t;
++t;
}
return avg;
}
Inside the loop avg
always is the average value of all values processed so far. In other words, if all the values are finite you should not get an overflow.