statsmodels

Capturing high multi-collinearity in statsmodels

|▌冷眼眸甩不掉的悲伤 提交于 2019-11-27 17:53:09
Say I fit a model in statsmodels mod = smf.ols('dependent ~ first_category + second_category + other', data=df).fit() When I do mod.summary() I may see the following: Warnings: [1] The condition number is large, 1.59e+05. This might indicate that there are strong multicollinearity or other numerical problems. Sometimes the warning is different (e.g. based on eigenvalues of the design matrix). How can I capture high-multi-collinearity conditions in a variable? Is this warning stored somewhere in the model object? Also, where can I find a description of the fields in summary() ? You can detect

Plotting confidence and prediction intervals with repeated entries

血红的双手。 提交于 2019-11-27 17:04:40
问题 I have a correlation plot for two variables, the predictor variable (temperature) on the x-axis, and the response variable (density) on the y-axis. My best fit least squares regression line is a 2nd order polynomial. I would like to also plot confidence and prediction intervals. The method described in this answer seems perfect. However, my dataset (n=2340) has repeated entries for many (x,y) pairs. My resulting plot looks like this: Here is my relevant code (slightly modified from linked

Statsmodels: Calculate fitted values and R squared

烈酒焚心 提交于 2019-11-27 15:07:07
问题 I am running a regression as follows ( df is a pandas dataframe): import statsmodels.api as sm est = sm.OLS(df['p'], df[['e', 'varA', 'meanM', 'varM', 'covAM']]).fit() est.summary() Which gave me, among others, an R-squared of 0.942 . So then I wanted to plot the original y-values and the fitted values. For this, I sorted the original values: orig = df['p'].values fitted = est.fittedvalues.values args = np.argsort(orig) import matplotlib.pyplot as plt plt.plot(orig[args], 'bo') plt.plot(orig

scikit-learn & statsmodels - which R-squared is correct?

◇◆丶佛笑我妖孽 提交于 2019-11-27 14:55:28
I'd like to choose the best algorithm for future. I found some solutions, but I didn't understand which R-Squared value is correct. For this, I divided my data into two as test and training, and I printed two different R squared values ​​below. import statsmodels.api as sm from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score lineer = LinearRegression() lineer.fit(x_train,y_train) lineerPredict = lineer.predict(x_test) scoreLineer = r2_score(y_test, lineerPredict) # First R-Squared model = sm.OLS(lineerPredict, y_test) print(model.fit().summary()) # Second R

Pandas rolling regression: alternatives to looping

柔情痞子 提交于 2019-11-27 13:29:49
问题 I got good use out of pandas' MovingOLS class (source here) within the deprecated stats/ols module. Unfortunately, it was gutted completely with pandas 0.20. The question of how to run rolling OLS regression in an efficient manner has been asked several times (here, for instance), but phrased a little broadly and left without a great answer, in my view. Here are my questions: How can I best mimic the basic framework of pandas' MovingOLS ? The most attractive feature of this class was the

Fixed effect in Pandas or Statsmodels

纵饮孤独 提交于 2019-11-27 12:12:09
Is there an existing function to estimate fixed effect (one-way or two-way) from Pandas or Statsmodels. There used to be a function in Statsmodels but it seems discontinued. And in Pandas, there is something called plm , but I can't import it or run it using pd.plm() . As noted in the comments, PanelOLS has been removed from Pandas as of version 0.20.0. So you really have three options: If you use Python 3 you can use linearmodels as specified in the more recent answer: https://stackoverflow.com/a/44836199/3435183 Just specify various dummies in your statsmodels specification, e.g. using pd

Run an OLS regression with Pandas Data Frame

↘锁芯ラ 提交于 2019-11-27 10:03:32
I have a pandas data frame and I would like to able to predict the values of column A from the values in columns B and C. Here is a toy example: import pandas as pd df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]}) Ideally, I would have something like ols(A ~ B + C, data = df) but when I look at the examples from algorithm libraries like scikit-learn it appears to feed the data to the model with a list of rows instead of columns. This would require me to reformat the data into lists inside lists, which seems to defeat the purpose of using

What statistics module for python supports one way ANOVA with post hoc tests (Tukey, Scheffe or other)?

不问归期 提交于 2019-11-27 09:42:31
问题 I have tried looking through multiple statistics modules for Python but can't seem to find any that support one-way ANOVA post hoc tests. 回答1: one way ANOVA can be used like from scipy import stats f_value, p_value = stats.f_oneway(data1, data2, data3, data4, ...) This is one way ANOVA and it returns F value and P value. There is significant difference If the P value is below your setting. The Tukey-kramer HSD test can be used like from statsmodels.stats.multicomp import pairwise_tukeyhsd

OLS Regression: Scikit vs. Statsmodels?

只谈情不闲聊 提交于 2019-11-27 09:30:06
问题 Short version : I was using the scikit LinearRegression on some data, but I'm used to p-values so put the data into the statsmodels OLS, and although the R^2 is about the same the variable coefficients are all different by large amounts. This concerns me since the most likely problem is that I've made an error somewhere and now I don't feel confident in either output (since likely I have made one model incorrectly but don't know which one). Longer version : Because I don't know where the

Missing intercepts of OLS Regression models in Python statsmodels

元气小坏坏 提交于 2019-11-27 08:45:22
问题 I am running a rolling for example of 100 window OLS regression estimation of the dataset found in this link (https://drive.google.com/drive/folders/0B2Iv8dfU4fTUMVFyYTEtWXlzYkk) as in the following format. time X Y 0.000543 0 10 0.000575 0 10 0.041324 1 10 0.041331 2 10 0.041336 3 10 0.04134 4 10 ... 9.987735 55 239 9.987739 56 239 9.987744 57 239 9.987749 58 239 9.987938 59 239 The third column (Y) in my dataset is my true value - that's what I wanted to predict (estimate). I want to do a