Fast incremental update of the mean and covariance in Python

问题

I have a Python script where I need to frequently update the mean and co-variance matrix. What I am currently doing is that each time I get a new data point $x$ (a vector), I recompute the mean and covariance as follows:

data.append(x) # My `data` is just a list of lists of floats (i.e., x is a list of floats)
self.mean = np.mean( data, axis=0) # self.mean is a list representing the center of data
self.cov = np.cov( data, rowvar=0)

The problem is that is not fast enough for me. Is there anyway to be more efficient by incrementally updating mean and cov without re-computing them based on all the data ?

Computing mean incrementally should be easy and I can figure it out. My main problem is how to update the covariance matrix self.cov.

回答1:

I'd do it by keeping track of the sum and sum of squares.

In the __init__:

self.sumx = 0
self.sumx2 = 0

And then in the append:

data.append(x)
self.sumx += x
self.sumx2 += x * x[:,np,newaxis]

self.mean = sumx / len(data)
self.cov = (self.sumx2 - self.mean * self.mean[:,np,newaxis])  / len(data)

Noting the [:,np.newaxis] broadcasting to find the produce of every pair of elements

回答2:

I just figured out that we can easily do that using the mdp library http://mdp-toolkit.sourceforge.net/api/mdp.utils.CovarianceMatrix-class.html

回答3:

For a variance (only the diagonal of the covariance matrix) it is simple. You need to also keep the sum of squares for your data. Recall that the formula for variance is: Var(x)=E[x^2]-(E[x])^2). So each step you are calculating your regular mean, and the mean of the squares sum.

This can be generalized for multivariate variables for a full covariance matrix. Have a look here.

回答4:

For mean calculation, you could store the mean of N data before, suppose it's called "before_mean", and when new data x comes, new mean of these N+1 data will be simply calculated like before:

new_mean = float(before_mean * N + x) / (N + 1)

so you don't need to recalculate before data.

For cov, I don't think there is simply way to solve that, and I am not sure about your data input, as cov always used with list other than number.

Just for curious, I think you don't need to care about this if dataset not that large, as it's O(N)

Hope it helps~

============ update ===========

import numpy as np
import random

data = []
means = []
for m in range(3):
    sample_data = random.sample(range(10), 5)
    means.append(np.mean(sample_data))
    data.append(sample_data)

# calculate origin cov
origin_cov = np.cov(data)
print origin_cov

# new data
x = random.sample(range(10), 5)
mean_x = np.mean(x)
var_x = np.var(x)
new_line_cov = []
new_cov = np.empty([len(data)+1, len(data)+1])
for idx, sample_data in enumerate(data):
    mul_x_sample = 0
    for (elem_x, elem_sample) in zip(x, sample_data):
            mul_x_sample += (elem_x * elem_sample)
    mul_x_sample = mul_x_sample / len(x)
    cov_x_sample = mul_x_sample - mean_x * means[idx]
    new_cov[idx] = np.append(origin_cov[idx],cov_x_sample)
    new_line_cov.append(cov_x_sample)
new_line_cov.append(var_x)
new_cov[len(data)] = np.array(new_line_cov)

print new_cov

the output result like below:

origin

[[ 9.7   2.7  -4.05]
 [ 2.7   3.7  -3.05]
 [-4.05 -3.05  5.7 ]]

new

[[ 9.7   2.7  -4.05  0.56]
 [ 2.7   3.7  -3.05  1.56]
 [-4.05 -3.05  5.7   0.36]
 [ 0.56  1.56  0.36  8.56]]

来源：https://stackoverflow.com/questions/37498612/fast-incremental-update-of-the-mean-and-covariance-in-python

标签

python

numpy

normal-distribution