How do I calculate r-squared using Python and Numpy?

后端 未结 11 1879
春和景丽
春和景丽 2020-11-28 19:29

I\'m using Python and Numpy to calculate a best fit polynomial of arbitrary degree. I pass a list of x values, y values, and the degree of the polynomial I want to fit (lin

相关标签:
11条回答
  • 2020-11-28 20:04

    R-squared is a statistic that only applies to linear regression.

    Essentially, it measures how much variation in your data can be explained by the linear regression.

    So, you calculate the "Total Sum of Squares", which is the total squared deviation of each of your outcome variables from their mean. . .

    \sum_{i}(y_{i} - y_bar)^2

    where y_bar is the mean of the y's.

    Then, you calculate the "regression sum of squares", which is how much your FITTED values differ from the mean

    \sum_{i}(yHat_{i} - y_bar)^2

    and find the ratio of those two.

    Now, all you would have to do for a polynomial fit is plug in the y_hat's from that model, but it's not accurate to call that r-squared.

    Here is a link I found that speaks to it a little.

    0 讨论(0)
  • 2020-11-28 20:06

    Here is a function to compute the weighted r-squared with Python and Numpy (most of the code comes from sklearn):

    from __future__ import division 
    import numpy as np
    
    def compute_r2_weighted(y_true, y_pred, weight):
        sse = (weight * (y_true - y_pred) ** 2).sum(axis=0, dtype=np.float64)
        tse = (weight * (y_true - np.average(
            y_true, axis=0, weights=weight)) ** 2).sum(axis=0, dtype=np.float64)
        r2_score = 1 - (sse / tse)
        return r2_score, sse, tse
    

    Example:

    from __future__ import print_function, division 
    import sklearn.metrics 
    
    def compute_r2_weighted(y_true, y_pred, weight):
        sse = (weight * (y_true - y_pred) ** 2).sum(axis=0, dtype=np.float64)
        tse = (weight * (y_true - np.average(
            y_true, axis=0, weights=weight)) ** 2).sum(axis=0, dtype=np.float64)
        r2_score = 1 - (sse / tse)
        return r2_score, sse, tse    
    
    def compute_r2(y_true, y_predicted):
        sse = sum((y_true - y_predicted)**2)
        tse = (len(y_true) - 1) * np.var(y_true, ddof=1)
        r2_score = 1 - (sse / tse)
        return r2_score, sse, tse
    
    def main():
        '''
        Demonstrate the use of compute_r2_weighted() and checks the results against sklearn
        '''        
        y_true = [3, -0.5, 2, 7]
        y_pred = [2.5, 0.0, 2, 8]
        weight = [1, 5, 1, 2]
        r2_score = sklearn.metrics.r2_score(y_true, y_pred)
        print('r2_score: {0}'.format(r2_score))  
        r2_score,_,_ = compute_r2(np.array(y_true), np.array(y_pred))
        print('r2_score: {0}'.format(r2_score))
        r2_score = sklearn.metrics.r2_score(y_true, y_pred,weight)
        print('r2_score weighted: {0}'.format(r2_score))
        r2_score,_,_ = compute_r2_weighted(np.array(y_true), np.array(y_pred), np.array(weight))
        print('r2_score weighted: {0}'.format(r2_score))
    
    if __name__ == "__main__":
        main()
        #cProfile.run('main()') # if you want to do some profiling
    

    outputs:

    r2_score: 0.9486081370449679
    r2_score: 0.9486081370449679
    r2_score weighted: 0.9573170731707317
    r2_score weighted: 0.9573170731707317
    

    This corresponds to the formula (mirror):

    with f_i is the predicted value from the fit, y_{av} is the mean of the observed data y_i is the observed data value. w_i is the weighting applied to each data point, usually w_i=1. SSE is the sum of squares due to error and SST is the total sum of squares.


    If interested, the code in R: https://gist.github.com/dhimmel/588d64a73fa4fef02c8f (mirror)

    0 讨论(0)
  • 2020-11-28 20:08

    From scipy.stats.linregress source. They use the average sum of squares method.

    import numpy as np
    
    x = np.array(x)
    y = np.array(y)
    
    # average sum of squares:
    ssxm, ssxym, ssyxm, ssym = np.cov(x, y, bias=1).flat
    
    r_num = ssxym
    r_den = np.sqrt(ssxm * ssym)
    r = r_num / r_den
    
    if r_den == 0.0:
        r = 0.0
    else:
        r = r_num / r_den
    
        if r > 1.0:
            r = 1.0
        elif r < -1.0:
            r = -1.0
    
    0 讨论(0)
  • 2020-11-28 20:12

    A very late reply, but just in case someone needs a ready function for this:

    scipy.stats.linregress

    i.e.

    slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(x, y)
    

    as in @Adam Marples's answer.

    0 讨论(0)
  • 2020-11-28 20:16

    You can execute this code directly, this will find you the polynomial, and will find you the R-value you can put a comment down below if you need more explanation.

    from scipy.stats import linregress
    import numpy as np
    
    x = np.array([1,2,3,4,5,6])
    y = np.array([2,3,5,6,7,8])
    
    p3 = np.polyfit(x,y,3) # 3rd degree polynomial, you can change it to any degree you want
    xp = np.linspace(1,6,6)  # 6 means the length of the line
    poly_arr = np.polyval(p3,xp)
    
    poly_list = [round(num, 3) for num in list(poly_arr)]
    slope, intercept, r_value, p_value, std_err = linregress(x, poly_list)
    print(r_value**2)
    
    0 讨论(0)
  • 2020-11-28 20:18

    From the numpy.polyfit documentation, it is fitting linear regression. Specifically, numpy.polyfit with degree 'd' fits a linear regression with the mean function

    E(y|x) = p_d * x**d + p_{d-1} * x **(d-1) + ... + p_1 * x + p_0

    So you just need to calculate the R-squared for that fit. The wikipedia page on linear regression gives full details. You are interested in R^2 which you can calculate in a couple of ways, the easisest probably being

    SST = Sum(i=1..n) (y_i - y_bar)^2
    SSReg = Sum(i=1..n) (y_ihat - y_bar)^2
    Rsquared = SSReg/SST
    

    Where I use 'y_bar' for the mean of the y's, and 'y_ihat' to be the fit value for each point.

    I'm not terribly familiar with numpy (I usually work in R), so there is probably a tidier way to calculate your R-squared, but the following should be correct

    import numpy
    
    # Polynomial Regression
    def polyfit(x, y, degree):
        results = {}
    
        coeffs = numpy.polyfit(x, y, degree)
    
         # Polynomial Coefficients
        results['polynomial'] = coeffs.tolist()
    
        # r-squared
        p = numpy.poly1d(coeffs)
        # fit values, and mean
        yhat = p(x)                         # or [p(z) for z in x]
        ybar = numpy.sum(y)/len(y)          # or sum(y)/len(y)
        ssreg = numpy.sum((yhat-ybar)**2)   # or sum([ (yihat - ybar)**2 for yihat in yhat])
        sstot = numpy.sum((y - ybar)**2)    # or sum([ (yi - ybar)**2 for yi in y])
        results['determination'] = ssreg / sstot
    
        return results
    
    0 讨论(0)
提交回复
热议问题