OLS using statsmodel.formula.api versus statsmodel.api

我的梦境 提交于 2019-12-03 00:19:39

The difference is due to the presence of intercept or not:

  • in statsmodels.formula.api, similarly to the R approach, a constant is automatically added to your data and an intercept in fitted
  • in statsmodels.api, you have to add a constant yourself (see the documentation here). Try using add_constant from statsmodels.api

    x1 = sm.add_constant(x1)
    

Came across this issue today and wanted to elaborate on @stellasia's answer because the statsmodels documentation is perhaps a bit ambiguous.

Unless you are using actual R-style string-formulas when instantiating OLS, you need to add a constant (literally a column of 1s) under both statsmodels.formulas.api and plain statsmodels.api. @Chetan is using R-style formatting here (formula='Sales ~ TV'), so he will not run into this subtlety, but for people with some Python knowledge but no R background this could be very confusing.

Furthermore it doesn't matter whether you specify the hasconst parameter when building the model. (Which is kind of silly.) In other words, unless you are using R-style string formulas, hasconst is ignored even though it is supposed to

[Indicate] whether the RHS includes a user-supplied constant

because, in the footnotes

No constant is added by the model unless you are using formulas.

The example below shows that both .formulas.api and .api will require a user-added column vector of 1s if not using R-style string formulas.

# Generate some relational data
np.random.seed(123)
nobs = 25 
x = np.random.random((nobs, 2)) 
x_with_ones = sm.add_constant(x, prepend=False)
beta = [.1, .5, 1] 
e = np.random.random(nobs)
y = np.dot(x_with_ones, beta) + e

Now throw x and y into Excel and run Data>Data Analysis>Regression, making sure "Constant is zero" is unchecked. You'll get the following coefficients:

Intercept       1.497761024
X Variable 1    0.012073045
X Variable 2    0.623936056

Now, try running this regression on x, not x_with_ones, in either statsmodels.formula.api or statsmodels.api with hasconst set to None, True, or False. You'll see that in each of those 6 scenarios, there is no intercept returned. (There are only 2 parameters.)

import statsmodels.formula.api as smf
import statsmodels.api as sm

print('smf models')
print('-' * 10)
for hc in [None, True, False]:
    model = smf.OLS(endog=y, exog=x, hasconst=hc).fit()
    print(model.params)

# smf models
# ----------
# [ 1.46852293  1.8558273 ]
# [ 1.46852293  1.8558273 ]
# [ 1.46852293  1.8558273 ]

Now running things correctly with a column vector of 1.0s added to x. You can use smf here but it's really not necessary if you're not using formulas.

print('sm models')
print('-' * 10)
for hc in [None, True, False]:
    model = sm.OLS(endog=y, exog=x_with_ones, hasconst=hc).fit()
    print(model.params)

# sm models
# ----------
# [ 0.01207304  0.62393606  1.49776102]
# [ 0.01207304  0.62393606  1.49776102]
# [ 0.01207304  0.62393606  1.49776102]
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!