statistics wiht large amount of data in C++ or Scilab or Octave or R

≡放荡痞女 提交于 2020-01-07 04:20:10

问题


I recently need to calculate the mean and standard deviation of a large number (about 800,000,000) of doubles. Considering that a double takes 8 bytes, if all the doubles are read into ram, it will take about 6 GB. I think I can use a divide and conquer approach with C++ or other high level languages, but that seems tedious. Is there a way that I can do this all at once with high level languages like R, Scilab or Octave? Thanks.


回答1:


It sounds like you could use R-Grid or Hadoop to good advantage.

You realize, of course, that it's easy to calculate both the mean and standard deviation without having to read all the values into memory. Just keep a running total, like this Java class does. All you need is the total sum, the total sum of squares, and the number of points. I keep the min and max for free.

This also makes clear how map-reduce would work. You'd instantiate several instances of Statistics, let each of them keep sum, sum of squares, and number of points for their portion of the 800M points. Then let the reduce step combine them and use the same formulas to get the final result.

import org.apache.commons.lang3.StringUtils;

import java.util.Collection;

/**
 * Statistics accumulates simple statistics for a given quantity "on the fly" - no array needed.
 * Resets back to zero when adding a value will overflow the sum of squares.
 * @author mduffy
 * @since 9/19/12 8:16 AM
 */
public class Statistics {
    private String quantityName;
    private int numValues;
    private double x;
    private double xsq;
    private double xmin;
    private double xmax;

    /**
     * Constructor
     */
    public Statistics() {
        this(null);
    }

    /**
     * Constructor
     * @param quantityName to describe the quantity (e.g. "heap size")
     */
    public Statistics(String quantityName) {
        this.quantityName = (StringUtils.isBlank(quantityName) ? "x" : quantityName);
        this.reset();
    }

    /**
     * Reset the object in the event of overflow by the sum of squares
     */
    public synchronized void reset() {
        this.numValues = 0;
        this.x = 0.0;
        this.xsq = 0.0;
        this.xmin = Double.MAX_VALUE;
        this.xmax = -Double.MAX_VALUE;
    }

    /**
     * Add a List of values
     * @param values to add to the statistics
     */
    public synchronized void addAll(Collection<Double> values) {
        for (Double value : values) {
            add(value);
        }
    }

    /**
     * Add an array of values
     * @param values to add to the statistics
     */
    public synchronized void allAll(double [] values) {
        for (double value : values) {
            add(value);
        }
    }

    /**
     * Add a value to current statistics
     * @param value to add for this quantity
     */
    public synchronized void add(double value) {
        double vsq = value*value;
        ++this.numValues;
        this.x += value;
        this.xsq += vsq; // TODO: how to detect overflow in Java?
        if (value < this.xmin) {
            this.xmin = value;
        }
        if (value > this.xmax) {
            this.xmax = value;
        }
    }

    /**
     * Get the current value of the mean or average
     * @return mean or average if one or more values have been added or zero for no values added
     */
    public synchronized double getMean() {
        double mean = 0.0;
        if (this.numValues > 0) {
            mean = this.x/this.numValues;
        }
        return mean;
    }

    /**
     * Get the current min value
     * @return current min value or Double.MAX_VALUE if no values added
     */
    public synchronized double getMin() {
        return this.xmin;
    }

    /**
     * Get the current max value
     * @return current max value or Double.MIN_VALUE if no values added
     */
    public synchronized double getMax() {
        return this.xmax;
    }

    /**
     * Get the current standard deviation
     * @return standard deviation for (N-1) dof or zero if one or fewer values added
     */
    public synchronized double getStdDev() {
        double stdDev = 0.0;
        if (this.numValues > 1) {
            stdDev = Math.sqrt((this.xsq-this.x*this.x/this.numValues)/(this.numValues-1));
        }
        return stdDev;
    }

    /**
     * Get the current number of values added
     * @return current number of values added or zero if overflow condition is encountered
     */
    public synchronized int getNumValues() {
        return this.numValues;
    }

    @Override
    public String toString() {
        final StringBuilder sb = new StringBuilder();
        sb.append("Statistics");
        sb.append("{quantityName='").append(quantityName).append('\'');
        sb.append(", numValues=").append(numValues);
        sb.append(", xmin=").append(xmin);
        sb.append(", mean=").append(this.getMean());
        sb.append(", std dev=").append(this.getStdDev());
        sb.append(", xmax=").append(xmax);
        sb.append('}');
        return sb.toString();
    }
}

And here's the JUnit test to prove that it's working:

import org.junit.Assert;
import org.junit.Test;

import java.util.Arrays;
import java.util.List;

/**
 * StatisticsTest
 * @author mduffy
 * @since 9/19/12 11:21 AM
 */
public class StatisticsTest {
    public static final double TOLERANCE = 1.0e-4;

    @Test
    public void testAddAll() {
        // The test uses a full array, but it's obvious that you could read them from a file one at a time and process until you're done.
        List<Double> values = Arrays.asList( 2.0, 4.0, 4.0, 4.0, 5.0, 5.0, 7.0, 9.0 );
        Statistics stats = new Statistics();
        stats.addAll(values);
        Assert.assertEquals(8, stats.getNumValues());
        Assert.assertEquals(2.0, stats.getMin(), TOLERANCE);
        Assert.assertEquals(9.0, stats.getMax(), TOLERANCE);
        Assert.assertEquals(5.0, stats.getMean(), TOLERANCE);
        Assert.assertEquals(2.138089935299395, stats.getStdDev(), TOLERANCE);
    }
}



回答2:


Not claiming that this is optimal, but in python (with numpy and numexpr modules) the following is easy (on 8G RAM machine):

import numpy, numpy as np, numexpr
x = np.random.uniform(0, 1, size=8e8)

print x.mean(), (numexpr.evaluate('sum(x*x)')/len(x)-
                (numexpr.evaluate('sum(x)')/len(x))**2)**.5
>>> 0.499991593345 0.288682001731

This doesn't consume more memory than the original array.




回答3:


This looks like a nice challenge, can't you create something similar with a tweaked mergesort? Just an idea. However this looks like dynamic programming, you could use multiple PC's to make things faster.



来源:https://stackoverflow.com/questions/12712756/statistics-wiht-large-amount-of-data-in-c-or-scilab-or-octave-or-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!