regression

Python Pandas: how to turn a DataFrame with “factors” into a design matrix for linear regression?

风流意气都作罢 提交于 2019-11-30 09:12:40
If memory servies me, in R there is a data type called factor which when used within a DataFrame can be automatically unpacked into the necessary columns of a regression design matrix. For example, a factor containing True/False/Maybe values would be transformed into: 1 0 0 0 1 0 or 0 0 1 for the purpose of using lower level regression code. Is there a way to achieve something similar using the pandas library? I see that there is some regression support within Pandas, but since I have my own customised regression routines I am really interested in the construction of the design matrix (a 2d

Getting statsmodels to use heteroskedasticity corrected standard errors in coefficient t-tests

匆匆过客 提交于 2019-11-30 09:03:01
I've been digging into the API of statsmodels.regression.linear_model.RegressionResults and have found how to retrieve different flavors of heteroskedasticity corrected standard errors (via properties like HC0_se , etc.) However, I can't quite figure out how to get the t -tests on the coefficients to use these corrected standard errors. Is there a way to do this in the API, or do I have to do it manually? If the latter, can you suggest any guidance on how to do this with statsmodels results? The fit method of the linear models, discrete models and GLM, take a cov_type and a cov_kwds argument

Python Multiple Linear Regression using OLS code with specific data?

时光怂恿深爱的人放手 提交于 2019-11-30 07:30:49
I am using the ols.py code downloaded at scipy Cookbook (the download is in the first paragraph with the bold OLS) but I need to understand rather than using random data for the ols function to do a multiple linear regression. I have a specific dependent variable y , and three explanatory variables. Every time I try to put in my variables in place of the random variables, it gives me the error: TypeError: this constructor takes no arguments. Can anyone help? Is this possible to do? Here is a copy of the ols code I am trying to use along with the variables that I am trying to input from _

Difference between Linear Regression Coefficients between Python and R

拟墨画扇 提交于 2019-11-30 07:04:42
I'm trying to run a linear regression in Python that I have already done in R in order to find variables with 0 coefficients. The issue I'm running into is that the linear regression in R returns NAs for columns with low variance while the scikit learn regression returns the coefficients. In the R code, I find and save these variables by saving the variables with NAs as output from the linear regression, but I can't seem to figure out a way to mimic this behavior in python. The code I'm using can be found below. R Code: a <- c(23, 45, 546, 42, 68, 15, 47) b <- c(1, 2, 4, 6, 34, 2, 8) c <- c(22

R nls singular gradient

淺唱寂寞╮ 提交于 2019-11-30 07:04:02
问题 I've tried searching the other threads on this topic but none of the fixes are working for me. I have the results of a natural experiment and I want to show the number of consecutive occurrences of an event fit an exponential distribution. My R shell is pasted below f <- function(x,a,b) {a * exp(b * x)} > x [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 [26] 26 27 > y [1] 1880 813 376 161 100 61 31 9 8 2 7 4 3 2 0 [16] 1 0 0 0 0 0 1 0 0 0 0 1 > dat2 x y 1 1 1880 2 2 813

Fitting linear model / ANOVA by group [duplicate]

安稳与你 提交于 2019-11-30 06:04:14
问题 This question already has answers here : Linear Regression and group by in R (10 answers) Closed 3 years ago . I'm trying to run anova() in R and running into some difficulty. This is what I've done up to now to help shed some light on my question. Here is the str() of my data to this point. str(mhw) 'data.frame': 500 obs. of 5 variables: $ r : int 1 2 3 4 5 6 7 8 9 10 ... $ c : int 1 1 1 1 1 1 1 1 1 1 ... $ grain: num 3.63 4.07 4.51 3.9 3.63 3.16 3.18 3.42 3.97 3.4 ... $ straw: num 6.37 6.24

Multiple regression analysis in R using QR decomposition

試著忘記壹切 提交于 2019-11-30 05:50:29
I am trying to write a function for solving multiple regression using QR decomposition. Input: y vector and X matrix; output: b, e, R^2. So far I`ve got this and am terribly stuck; I think I have made everything way too complicated: QR.regression <- function(y, X) { X <- as.matrix(X) y <- as.vector(y) p <- as.integer(ncol(X)) if (is.na(p)) stop("ncol(X) is invalid") n <- as.integer(nrow(X)) if (is.na(n)) stop("nrow(X) is invalid") nr <- length(y) nc <- NCOL(X) # Householder for (j in seq_len(nc)) { id <- seq.int(j, nr) sigma <- sum(X[id, j]^2) s <- sqrt(sigma) diag_ej <- X[j, j] gamma <- 1.0 /

python stats models - quadratic term in regression

喜你入骨 提交于 2019-11-30 05:11:54
I have the following linear regression: import statsmodels.formula.api as sm model = sm.ols(formula = 'a ~ b + c', data = data).fit() I want to add a quadratic term for b in this model. Is there a simple way to do this with statsmodels.ols? Is there a better package I should be using to achieve this? Although the solution by Alexander is working, in some situations it is not very convenient. For example, each time you want to predict the outcome of the model for new values, you need to remember to pass both b**2 and b values which is cumbersome and should not be necessary. Although patsy does

Partial Least Squares Library

时间秒杀一切 提交于 2019-11-30 03:44:50
There was already a question like this, but it was not answered, so I try to post it again. Does anyone know of an open-source implementation of a partial least squares algorithm in C++ (or C)? Or maybe a library that does it? FastPLS is a library that provides a C/C++ and MATLAB interface for speeding up partial least squares. Its author is Balaji Vasan Srinivasan. The author worked under the supervision of Professor Ramani Duraiswami at the University of Maryland, College Park, MD, USA. Partial Least Squares and Generalized Partial Least Squares models based on NIPALS algorithm . implement

Regression (logistic) in R: Finding x value (predictor) for a particular y value (outcome)

僤鯓⒐⒋嵵緔 提交于 2019-11-30 03:32:37
问题 I've fitted a logistic regression model that predicts the a binary outcome vs from mpg ( mtcars dataset). The plot is shown below. How can I determine the mpg value for any particular vs value? For example, I'm interested in finding out what the mpg value is when the probability of vs is 0.50. Appreciate any help anyone can provide! model <- glm(vs ~ mpg, data = mtcars, family = binomial) ggplot(mtcars, aes(mpg, vs)) + geom_point() + stat_smooth(method = "glm", method.args = list(family =