Can I get data spread (noise) from singular value decomposition?

问题

I'm was hoping to use singular value decomposition to estimate the standard deviation of eliptoid data. I'm not sure if this is the best approach and I may be overthinking the entire process so I need some help.

I simulated some data using the following script...

from matplotlib import pyplot as plt
import numpy


def svd_example():
    # simulate some data...
    # x values have standard deviation 3000
    xdata = numpy.random.normal(0, 3000, 5000).reshape(-1, 1)
    # y values standard deviation 300
    ydata = numpy.random.normal(0, 300, 5000).reshape(-1, 1)
    # apply some rotation
    ydata_rotated = ydata + (xdata * 0.5)
    data = numpy.hstack((xdata, ydata_rotated))

    # get singular values
    left_singular_matrix, singular_values, right_singular_matrix = numpy.linalg.svd(data)
    print 'singular values', singular_values

    # plot data....
    plt.scatter(data[:, 0], data[:, 1], s=5)
    plt.ylim(-15000, 15000)
    plt.show()

svd_example()

I get singular values of...

>>> singular values [ 234001.71228678   18850.45155942]

My data looks like this...

I was under the assumption that the singular values would give me some indication of the spread of data regardless of it's rotation, right? But these values, [234001.71228678 18850.45155942], make no sense to me. My standard deviations were 3000 and 300. Do these singular values represent variance? How do I convert them?

回答1:

The singular values indeed give some indication of the spread. In fact, they are related to the standard deviation in these directions. However, they are not normalized. If you divide by the square-root of the number samples, you will get values that closely resemble the standard deviations used for creating the data:

singular_values / np.sqrt(5000)
# array([ 3398.61320614,   264.00975837])

Why do you get 3400 and 264 instead of 3000 and 300? That is because ydata + (xdata * 0.5) is not a rotation but a shearing operation. A real rotation would preserve the original standard deviations.

For example, the following code would rotate the data by 40 degrees:

# apply some rotation
s = numpy.sin(40 * numpy.pi / 180)
c = numpy.cos(40 * numpy.pi / 180)
data = numpy.hstack((xdata, ydata)).dot([[c, s], [-s, c]])

With such a rotation you will get normalized singular values that are pretty close to the original standard deviations.

Edit: On Normalization

I have to admit, normalization is probably not the correct term to apply here. It does not necessarily mean to scale values to a certain range. Normalization, as I meant it, was to bring values into a defined range, independent of the number of samples.

To understand where the division by sqrt(5000) comes from, let's talk about the standard deviation. Let x, be a data vector of n samples with zero mean. Then the standard deviation is computed as sqrt(sum(x**2)/n) or sqrt(sum(x**2)) / sqrt(n). Now, you can think of the singular value decomposition to compute only the sqrt(sum(x**2)) part, so we have to divide by sqrt(n) ourselves.

I'm afraid, this is not a very mathematical explanation, but hopefully it conveys the idea.

来源：https://stackoverflow.com/questions/36154950/can-i-get-data-spread-noise-from-singular-value-decomposition

标签

python

numpy

linear-algebra

svd