Output values differ between R and Python?

后端 未结 2 865
日久生厌
日久生厌 2020-12-18 01:38

Perhaps I am doing something wrong while z-normalizing my array. Can someone take a look at this and suggest what\'s going on?

In R:



        
相关标签:
2条回答
  • 2020-12-18 02:16

    I believe that your NumPy result is correct. I would do the normalization in a simpler way, though:

    >>> data = np.array([2.02, 2.33, 2.99, 6.85, 9.20, 8.80, 7.50, 6.00, 5.85, 3.85, 4.85, 3.85, 2.22, 1.45, 1.34])
    >>> data -= data.mean()
    >>> data /= data.std()
    >>> data
    array([-1.01406602, -0.89253491, -0.63379126,  0.87946705,  1.80075126,
            1.64393692,  1.13429034,  0.54623659,  0.48743122, -0.29664045,
            0.09539539, -0.29664045, -0.93565885, -1.23752644, -1.28065039])
    

    The difference between your two results lies in the normalization: with r as the R result:

    >>> r / data
    array([ 0.96609173,  0.96609173,  0.96609173,  0.96609179,  0.96609179, 0.96609181,  0.9660918 ,  0.96609181,
            0.96609179,  0.96609179,        0.9660918 ,  0.96609179,  0.96609175,  0.96609176,  0.96609177])
    

    Thus, your two results are mostly simply proportional to each other. You may therefore want to compare the standard deviations obtained with R and with Python.

    PS: Now that I am thinking of it, it may be that the variance in NumPy and in R is not defined in the same way: for N elements, some tools normalize with N-1 instead of N, when calculating the variance. You may want to check this.

    PPS: Here is the reason for the discrepancy: the difference in factors comes from two different normalization conventions: the observed factor is simply sqrt(14/15) = 0.9660917… (because the data has 15 elements). Thus, in order to obtain in R the same result as in Python, you need to divide the R result by this factor.

    0 讨论(0)
  • 2020-12-18 02:27

    The reason you're getting different results has to do with how the standard deviation/variance is calculated. R calculates using denominator N-1, while numpy calculates using denominator N. You can get a numpy result equal to the R result by using data.std(ddof=1), which tells numpy to use N-1 as the denominator when calculating the variance.

    0 讨论(0)
提交回复
热议问题