How do I calculate r-squared using Python and Numpy?

后端 未结 11 1880
春和景丽
春和景丽 2020-11-28 19:29

I\'m using Python and Numpy to calculate a best fit polynomial of arbitrary degree. I pass a list of x values, y values, and the degree of the polynomial I want to fit (lin

相关标签:
11条回答
  • 2020-11-28 20:19

    From yanl (yet-another-library) sklearn.metrics has an r2_score function;

    from sklearn.metrics import r2_score
    
    coefficient_of_dermination = r2_score(y, p(x))
    
    0 讨论(0)
  • 2020-11-28 20:23

    I have been using this successfully, where x and y are array-like.

    def rsquared(x, y):
        """ Return R^2 where x and y are array-like."""
    
        slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(x, y)
        return r_value**2
    
    0 讨论(0)
  • 2020-11-28 20:24

    Here's a very simple python function to compute R^2 from the actual and predicted values assuming y and y_hat are pandas series:

    def r_squared(y, y_hat):
        y_bar = y.mean()
        ss_tot = ((y-y_bar)**2).sum()
        ss_res = ((y-y_hat)**2).sum()
        return 1 - (ss_res/ss_tot)
    
    0 讨论(0)
  • 2020-11-28 20:25

    The wikipedia article on r-squareds suggests that it may be used for general model fitting rather than just linear regression.

    0 讨论(0)
  • 2020-11-28 20:26

    I originally posted the benchmarks below with the purpose of recommending numpy.corrcoef, foolishly not realizing that the original question already uses corrcoef and was in fact asking about higher order polynomial fits. I've added an actual solution to the polynomial r-squared question using statsmodels, and I've left the original benchmarks, which while off-topic, are potentially useful to someone.


    statsmodels has the capability to calculate the r^2 of a polynomial fit directly, here are 2 methods...

    import statsmodels.api as sm
    import statsmodels.formula.api as smf
    
    # Construct the columns for the different powers of x
    def get_r2_statsmodels(x, y, k=1):
        xpoly = np.column_stack([x**i for i in range(k+1)])    
        return sm.OLS(y, xpoly).fit().rsquared
    
    # Use the formula API and construct a formula describing the polynomial
    def get_r2_statsmodels_formula(x, y, k=1):
        formula = 'y ~ 1 + ' + ' + '.join('I(x**{})'.format(i) for i in range(1, k+1))
        data = {'x': x, 'y': y}
        return smf.ols(formula, data).fit().rsquared # or rsquared_adj
    

    To further take advantage of statsmodels, one should also look at the fitted model summary, which can be printed or displayed as a rich HTML table in Jupyter/IPython notebook. The results object provides access to many useful statistical metrics in addition to rsquared.

    model = sm.OLS(y, xpoly)
    results = model.fit()
    results.summary()
    

    Below is my original Answer where I benchmarked various linear regression r^2 methods...

    The corrcoef function used in the Question calculates the correlation coefficient, r, only for a single linear regression, so it doesn't address the question of r^2 for higher order polynomial fits. However, for what it's worth, I've come to find that for linear regression, it is indeed the fastest and most direct method of calculating r.

    def get_r2_numpy_corrcoef(x, y):
        return np.corrcoef(x, y)[0, 1]**2
    

    These were my timeit results from comparing a bunch of methods for 1000 random (x, y) points:

    • Pure Python (direct r calculation)
      • 1000 loops, best of 3: 1.59 ms per loop
    • Numpy polyfit (applicable to n-th degree polynomial fits)
      • 1000 loops, best of 3: 326 µs per loop
    • Numpy Manual (direct r calculation)
      • 10000 loops, best of 3: 62.1 µs per loop
    • Numpy corrcoef (direct r calculation)
      • 10000 loops, best of 3: 56.6 µs per loop
    • Scipy (linear regression with r as an output)
      • 1000 loops, best of 3: 676 µs per loop
    • Statsmodels (can do n-th degree polynomial and many other fits)
      • 1000 loops, best of 3: 422 µs per loop

    The corrcoef method narrowly beats calculating the r^2 "manually" using numpy methods. It is >5X faster than the polyfit method and ~12X faster than the scipy.linregress. Just to reinforce what numpy is doing for you, it's 28X faster than pure python. I'm not well-versed in things like numba and pypy, so someone else would have to fill those gaps, but I think this is plenty convincing to me that corrcoef is the best tool for calculating r for a simple linear regression.

    Here's my benchmarking code. I copy-pasted from a Jupyter Notebook (hard not to call it an IPython Notebook...), so I apologize if anything broke on the way. The %timeit magic command requires IPython.

    import numpy as np
    from scipy import stats
    import statsmodels.api as sm
    import math
    
    n=1000
    x = np.random.rand(1000)*10
    x.sort()
    y = 10 * x + (5+np.random.randn(1000)*10-5)
    
    x_list = list(x)
    y_list = list(y)
    
    def get_r2_numpy(x, y):
        slope, intercept = np.polyfit(x, y, 1)
        r_squared = 1 - (sum((y - (slope * x + intercept))**2) / ((len(y) - 1) * np.var(y, ddof=1)))
        return r_squared
        
    def get_r2_scipy(x, y):
        _, _, r_value, _, _ = stats.linregress(x, y)
        return r_value**2
        
    def get_r2_statsmodels(x, y):
        return sm.OLS(y, sm.add_constant(x)).fit().rsquared
        
    def get_r2_python(x_list, y_list):
        n = len(x_list)
        x_bar = sum(x_list)/n
        y_bar = sum(y_list)/n
        x_std = math.sqrt(sum([(xi-x_bar)**2 for xi in x_list])/(n-1))
        y_std = math.sqrt(sum([(yi-y_bar)**2 for yi in y_list])/(n-1))
        zx = [(xi-x_bar)/x_std for xi in x_list]
        zy = [(yi-y_bar)/y_std for yi in y_list]
        r = sum(zxi*zyi for zxi, zyi in zip(zx, zy))/(n-1)
        return r**2
        
    def get_r2_numpy_manual(x, y):
        zx = (x-np.mean(x))/np.std(x, ddof=1)
        zy = (y-np.mean(y))/np.std(y, ddof=1)
        r = np.sum(zx*zy)/(len(x)-1)
        return r**2
        
    def get_r2_numpy_corrcoef(x, y):
        return np.corrcoef(x, y)[0, 1]**2
        
    print('Python')
    %timeit get_r2_python(x_list, y_list)
    print('Numpy polyfit')
    %timeit get_r2_numpy(x, y)
    print('Numpy Manual')
    %timeit get_r2_numpy_manual(x, y)
    print('Numpy corrcoef')
    %timeit get_r2_numpy_corrcoef(x, y)
    print('Scipy')
    %timeit get_r2_scipy(x, y)
    print('Statsmodels')
    %timeit get_r2_statsmodels(x, y)
    
    0 讨论(0)
提交回复
热议问题