regression

Stepwise Regression in Python

蹲街弑〆低调 提交于 2019-11-30 01:43:18
How to perform stepwise regression in python ? There are methods for OLS in SCIPY but I am not able to do stepwise. Any help in this regard would be a great help. Thanks. Edit: I am trying to build a linear regression model. I have 5 independent variables and using forward stepwise regression, I aim to select variables such that my model has the lowest p-value. Following link explains the objective: https://www.google.co.in/url?sa=t&rct=j&q=&esrc=s&source=web&cd=4&ved=0CEAQFjAD&url=http%3A%2F%2Fbusiness.fullerton.edu%2Fisds%2Fjlawrence%2FStat-On-Line%2FExcel%2520Notes%2FExcel%2520Notes%2520-

GridSearchCV - XGBoost - Early Stopping

白昼怎懂夜的黑 提交于 2019-11-30 00:30:36
i am trying to do hyperparemeter search with using scikit-learn's GridSearchCV on XGBoost. During gridsearch i'd like it to early stop, since it reduce search time drastically and (expecting to) have better results on my prediction/regression task. I am using XGBoost via its Scikit-Learn API. model = xgb.XGBRegressor() GridSearchCV(model, paramGrid, verbose=verbose ,fit_params={'early_stopping_rounds':42}, cv=TimeSeriesSplit(n_splits=cv).get_n_splits([trainX, trainY]), n_jobs=n_jobs, iid=iid).fit(trainX,trainY) I tried to give early stopping parameters with using fit_params, but then it throws

how does sklearn do Linear regression when p >n?

久未见 提交于 2019-11-29 23:23:04
问题 it's known that when the number of variables (p) is larger than the number of samples (n) the least square estimator is not defined. In sklearn I receive this values: In [30]: lm = LinearRegression().fit(xx,y_train) In [31]: lm.coef_ Out[31]: array([[ 0.20092363, -0.14378298, -0.33504391, ..., -0.40695124, 0.08619906, -0.08108713]]) In [32]: xx.shape Out[32]: (1097, 3419) Call [30] should return an error. How does sklearn work when p>n like in this case? EDIT: It seems that the matrix is

Simple multidimensional curve fitting

孤者浪人 提交于 2019-11-29 23:08:41
I have a bunch of data, generally in the form a, b, c, ..., y where y = f(a, b, c...) Most of them are three and four variables, and have 10k - 10M records. My general assumption is that they are algebraic in nature, something like: y = P1 a^E1 + P2 b^E2 + P3 c^E3 Unfortunately, my last statistical analysis class was 20 years ago. What is the easiest way to get a good approximation of f? Open source tools with a very minimal learning curve (i.e. something where I could get a decent approximation in an hour or so) would be ideal. Thanks! David Z In case it's useful, here's a Numpy/Scipy (Python

Why is it inadvisable to get statistical summary information for regression coefficients from glmnet model?

橙三吉。 提交于 2019-11-29 20:52:04
I have a regression model with binary outcome. I fitted the model with glmnet and got the selected variables and their coefficients. Since glmnet doesn't calculate variable importance, I would like to feed the exact output (selected variables and their coefficients) to glm to get the information (Standard errors, etc). I searched r documents, it seems I can use "method" option in glm to specify user defined function. But I failed to do so, could someone help me with this? "It is a very natural question to ask for standard errors of regression coefficients or other estimated quantities. In

how to run lm regression for every column in R

隐身守侯 提交于 2019-11-29 18:07:30
I have data frame as: df=data.frame(x=rnorm(100),y1=rnorm(100),y2=rnorm(100),y3=...) I want to run a loop which regresses each column starting from the second column on the first column: for(i in names(df[,-1])){ model = lm(i~x, data=df) } But I failed. The point is that I want to do a loop of regression for each column and some column names is just a number (e.g. 404.1). I cannot find a way to run a loop for each column using the above command. Your code looks fine except when you call i within lm , R will read i as a string, which you can't regress things against. Using get will allow you to

Regression and summary statistics by group within a data.table

守給你的承諾、 提交于 2019-11-29 17:50:42
问题 I would like to calculate some summary statistics and perform different regressions by group within a data table, and have the results in "wide" format (i.e. one row per group with several columns). I can do it in multiple steps, but it seems like it should be possible to do all at once. Consider this example data : set.seed=46984 dt <- data.table(ID=c(rep('Frank',5),rep('Tony',5),rep('Ed',5)), y=rnorm(15), x=rnorm(15), z=rnorm(15),key="ID") dt # ID y x z # 1: Ed 0.2129400 -0.3024061 0

Can't get aggregate() work for regression by group

北慕城南 提交于 2019-11-29 17:41:54
I want to use aggregate with this custom function: #linear regression f-n CalculateLinRegrDiff = function (sample){ fit <- lm(value~ date, data = sample) diff(range(fit$fitted)) } dataset2 = aggregate(value ~ id + col, dataset, CalculateLinRegrDiff(dataset)) I receive the error: Error in get(as.character(FUN), mode = "function", envir = envir) : object 'FUN' of mode 'function' was not found What is wrong? Your syntax on using aggregate is wrong in the first place. Pass function CalculateLinRegrDiff not an evaluated one CalculateLinRegrDiff(dataset) to FUN argument. Secondly, you've chosen the

How to write multivariate logarithmic regression with Python and sklearn?

左心房为你撑大大i 提交于 2019-11-29 17:10:28
I wrote a code for multivariate polynomial regression, I used polynomial features and transformation function from sklearn. Is it possible to make multivariate logarithmic regression? Does sklearn have some kind of logarithmic transformation, like it has for polynomial features? How can I write multivariate logarithmic regression in python? This is my code for multivariate polynomial features: import numpy as np import pandas as pd import math import xlrd from sklearn import linear_model from sklearn.model_selection import train_test_split from sklearn.preprocessing import PolynomialFeatures

Prediction of 'mlm' linear model object from `lm()`

[亡魂溺海] 提交于 2019-11-29 16:15:11
I have three datasets: response - matrix of 5(samples) x 10(dependent variables) predictors - matrix of 5(samples) x 2(independent variables) test_set - matrix of 10(samples) x 10(dependent variables defined in response) response <- matrix(sample.int(15, size = 5*10, replace = TRUE), nrow = 5, ncol = 10) colnames(response) <- c("1_DV","2_DV","3_DV","4_DV","5_DV","6_DV","7_DV","8_DV","9_DV","10_DV") predictors <- matrix(sample.int(15, size = 7*2, replace = TRUE), nrow = 5, ncol = 2) colnames(predictors) <- c("1_IV","2_IV") test_set <- matrix(sample.int(15, size = 10*2, replace = TRUE), nrow =