I am running a regression as follows (df is a pandas dataframe):
import statsmodels.api as sm
est = sm.OLS(df['p'], df[['e', 'varA', 'meanM', 'varM', 'covAM']]).fit()
est.summary()
Which gave me, among others, an R-squared of 0.942. So then I wanted to plot the original y-values and the fitted values. For this, I sorted the original values:
orig = df['p'].values
fitted = est.fittedvalues.values
args = np.argsort(orig)
import matplotlib.pyplot as plt
plt.plot(orig[args], 'bo')
plt.plot(orig[args]-resid[args], 'ro')
plt.show()
This, however, gave me a graph where the values were completely off. Nothing that would suggest an R-squared of 0.9. Therefore, I tried to calculate it manually myself:
yBar = df['p'].mean()
SSTot = df['p'].apply(lambda x: (x-yBar)**2).sum()
SSReg = ((est.fittedvalues - yBar)**2).sum()
1 - SSReg/SSTot
Out[79]: 0.2618159806908984
Am I doing something wrong? Or is there a reason why my computation is so far off what statsmodels is getting? SSTot, SSReg have values of 48084, 35495.
If you do not include an intercept (constant explanatory variable) in your model, statsmodels computes R-squared based on un-centred total sum of squares, ie.
tss = (ys ** 2).sum() # un-centred total sum of squares
as opposed to
tss = ((ys - ys.mean())**2).sum() # centred total sum of squares
as a result, R-squared would be much higher.
This is mathematically correct. Because, R-squared should indicate how much of the variation is explained by the full-model comparing to the reduced model. If you define your model as:
ys = beta1 . xs + beta0 + noise
then the reduced model can be: ys = beta0 + noise, where the estimate for beta0 is the sample average, thus we have: noise = ys - ys.mean(). That is where de-meaning comes from in a model with intercept.
But from a model like:
ys = beta . xs + noise
you may only reduce to: ys = noise. Since noise is assumed zero-mean, you may not de-mean ys. Therefore, unexplained variation in the reduced model is the un-centred total sum of squares.
This is documented here under rsquared item. Set yBar equal to zero, and I would expect you will get the same number.
来源:https://stackoverflow.com/questions/24851787/statsmodels-calculate-fitted-values-and-r-squared