问题
I do this linear regression
with StatsModels
:
import numpy as np
import statsmodels.api as sm
from statsmodels.sandbox.regression.predstd import wls_prediction_std
n = 100
x = np.linspace(0, 10, n)
e = np.random.normal(size=n)
y = 1 + 0.5*x + 2*e
X = sm.add_constant(x)
re = sm.OLS(y, X).fit()
print(re.summary())
prstd, iv_l, iv_u = wls_prediction_std(re)
My questions are, iv_l
and iv_u
are the upper and lower confidence intervals or prediction intervals?
How I get others?
I need the confidence and prediction intervals for all points, to do a plot.
回答1:
update see the second answer which is more recent. Some of the models and results classes have now a get_prediction
method that provides additional information including prediction intervals and/or confidence intervals for the predicted mean.
old answer:
iv_l
and iv_u
give you the limits of the prediction interval for each point.
Prediction interval is the confidence interval for an observation and includes the estimate of the error.
I think, confidence interval for the mean prediction is not yet available in statsmodels
.
(Actually, the confidence interval for the fitted values is hiding inside the summary_table of influence_outlier, but I need to verify this.)
Proper prediction methods for statsmodels are on the TODO list.
Addition
Confidence intervals are there for OLS but the access is a bit clumsy.
To be included after running your script:
from statsmodels.stats.outliers_influence import summary_table
st, data, ss2 = summary_table(re, alpha=0.05)
fittedvalues = data[:, 2]
predict_mean_se = data[:, 3]
predict_mean_ci_low, predict_mean_ci_upp = data[:, 4:6].T
predict_ci_low, predict_ci_upp = data[:, 6:8].T
# Check we got the right things
print np.max(np.abs(re.fittedvalues - fittedvalues))
print np.max(np.abs(iv_l - predict_ci_low))
print np.max(np.abs(iv_u - predict_ci_upp))
plt.plot(x, y, 'o')
plt.plot(x, fittedvalues, '-', lw=2)
plt.plot(x, predict_ci_low, 'r--', lw=2)
plt.plot(x, predict_ci_upp, 'r--', lw=2)
plt.plot(x, predict_mean_ci_low, 'r--', lw=2)
plt.plot(x, predict_mean_ci_upp, 'r--', lw=2)
plt.show()

This should give the same results as SAS, http://jpktd.blogspot.ca/2012/01/nice-thing-about-seeing-zeros.html
回答2:
For test data you can try to use the following.
predictions = result.get_prediction(out_of_sample_df)
predictions.summary_frame(alpha=0.05)
I found the summary_frame() method buried here and you can find the get_prediction() method here. You can change the significance level of the confidence interval and prediction interval by modifying the "alpha" parameter.
I am posting this here because this was the first post that comes up when looking for a solution for confidence & prediction intervals – even though this concerns itself with test data rather.
Here's a function to take a model, new data, and an arbitrary quantile, using this approach:
def ols_quantile(m, X, q):
# m: OLS model.
# X: X matrix.
# q: Quantile.
#
# Set alpha based on q.
a = q * 2
if q > 0.5:
a = 2 * (1 - q)
predictions = m.get_prediction(X)
frame = predictions.summary_frame(alpha=a)
if q > 0.5:
return frame.obs_ci_upper
return frame.obs_ci_lower
回答3:
You can get the prediction intervals by using LRPI() class from the Ipython notebook in my repo (https://github.com/shahejokarian/regression-prediction-interval).
You need to set the t value to get the desired confidence interval for the prediction values, otherwise the default is 95% conf. interval.
The LRPI class uses sklearn.linear_model's LinearRegression , numpy and pandas libraries.
There is an example shown in the notebook too.
回答4:
summary_frame
and summary_table
work well when you need exact results for a single quantile, but don't vectorize well. This will provide a normal approximation of the prediction interval (not confidence interval) and works for a vector of quantiles:
def ols_quantile(m, X, q):
# m: Statsmodels OLS model.
# X: X matrix of data to predict.
# q: Quantile.
#
from scipy.stats import norm
mean_pred = m.predict(X)
se = np.sqrt(m.scale)
return mean_pred + norm.ppf(q) * se
回答5:
You can calculate them based on results given by statsmodel and the normality assumptions.
Here is an example for OLS and CI for the mean value:
import statsmodels.api as sm
import numpy as np
from scipy import stats
#Significance level:
sl = 0.05
#Evaluate mean value at a required point x0. Here, at the point (0.0,2.0) for N_model=2:
x0 = np.asarray([1.0, 0.0, 2.0])# If you have no constant in your model, remove the first 1.0. For more dimensions, add the desired values.
#Get an OLS model based on output y and the prepared vector X (as in your notation):
model = sm.OLS(endog = y, exog = X )
results = model.fit()
#Get two-tailed t-values:
(t_minus, t_plus) = stats.t.interval(alpha = (1.0 - sl), df = len(results.resid) - len(x0) )
y_value_at_x0 = np.dot(results.params, x0)
lower_bound = y_value_at_x0 + t_minus*np.sqrt(results.mse_resid*( np.dot(np.dot(x0.T,results.normalized_cov_params),x0) ))
upper_bound = y_value_at_x0 + t_plus*np.sqrt(results.mse_resid*( np.dot(np.dot(x0.T,results.normalized_cov_params),x0) ))
You can wrap a nice function around this with input results, point x0 and significance level sl.
I am unsure now if you can use this for WLS() since there are extra things happening there.
Ref: Ch3 in [D.C. Montgomery and E.A. Peck. “Introduction to Linear Regression Analysis.” 4th. Ed., Wiley, 1992].
来源:https://stackoverflow.com/questions/17559408/confidence-and-prediction-intervals-with-statsmodels