问题
I have a Python script where I need to frequently update the mean and co-variance matrix. What I am currently doing is that each time I get a new data point $x$ (a vector), I recompute the mean and covariance as follows:
data.append(x) # My `data` is just a list of lists of floats (i.e., x is a list of floats)
self.mean = np.mean( data, axis=0) # self.mean is a list representing the center of data
self.cov = np.cov( data, rowvar=0)
The problem is that is not fast enough for me. Is there anyway to be more efficient by incrementally updating mean and cov without re-computing them based on all the data ?
Computing mean incrementally should be easy and I can figure it out. My main problem is how to update the covariance matrix self.cov.
回答1:
I'd do it by keeping track of the sum and sum of squares.
In the __init__:
self.sumx = 0
self.sumx2 = 0
And then in the append:
data.append(x)
self.sumx += x
self.sumx2 += x * x[:,np,newaxis]
self.mean = sumx / len(data)
self.cov = (self.sumx2 - self.mean * self.mean[:,np,newaxis])  / len(data)
Noting the [:,np.newaxis] broadcasting to find the produce of every pair of elements
回答2:
I just figured out that we can easily do that using the mdp library http://mdp-toolkit.sourceforge.net/api/mdp.utils.CovarianceMatrix-class.html
回答3:
For a variance (only the diagonal of the covariance matrix) it is simple. You need to also keep the sum of squares for your data. Recall that the formula for variance is: Var(x)=E[x^2]-(E[x])^2). So each step you are calculating your regular mean, and the mean of the squares sum.
This can be generalized for multivariate variables for a full covariance matrix. Have a look here.
回答4:
For mean calculation, you could store the mean of N data before, suppose it's called "before_mean", and when new data x comes, new mean of these N+1 data will be simply calculated like before:
new_mean = float(before_mean * N + x) / (N + 1)
so you don't need to recalculate before data.
For cov, I don't think there is simply way to solve that, and I am not sure about your data input, as cov always used with list other than number.
Just for curious, I think you don't need to care about this if dataset not that large, as it's O(N)
Hope it helps~
============ update ===========
import numpy as np
import random
data = []
means = []
for m in range(3):
    sample_data = random.sample(range(10), 5)
    means.append(np.mean(sample_data))
    data.append(sample_data)
# calculate origin cov
origin_cov = np.cov(data)
print origin_cov
# new data
x = random.sample(range(10), 5)
mean_x = np.mean(x)
var_x = np.var(x)
new_line_cov = []
new_cov = np.empty([len(data)+1, len(data)+1])
for idx, sample_data in enumerate(data):
    mul_x_sample = 0
    for (elem_x, elem_sample) in zip(x, sample_data):
            mul_x_sample += (elem_x * elem_sample)
    mul_x_sample = mul_x_sample / len(x)
    cov_x_sample = mul_x_sample - mean_x * means[idx]
    new_cov[idx] = np.append(origin_cov[idx],cov_x_sample)
    new_line_cov.append(cov_x_sample)
new_line_cov.append(var_x)
new_cov[len(data)] = np.array(new_line_cov)
print new_cov
the output result like below:
origin
[[ 9.7   2.7  -4.05]
 [ 2.7   3.7  -3.05]
 [-4.05 -3.05  5.7 ]]
new
[[ 9.7   2.7  -4.05  0.56]
 [ 2.7   3.7  -3.05  1.56]
 [-4.05 -3.05  5.7   0.36]
 [ 0.56  1.56  0.36  8.56]]
来源:https://stackoverflow.com/questions/37498612/fast-incremental-update-of-the-mean-and-covariance-in-python