glm

predict.glm() with three new categories in the test data (r)(error)

一曲冷凌霜 提交于 2019-12-01 13:21:36
问题 I have a data set called data which has 481 092 rows. I split data into two equal halves: The first halve (row 1: 240 546) is called train and was used for the glm() ; the second halve (row 240 547 : 481 092) is called test and should be used to validate the model; Then I started the regression: testreg <- glm(train$returnShipment ~ train$size + train$color + train$price + train$manufacturerID + train$salutation + train$state + train$age + train$deliverytime, family=binomial(link="logit"),

h2o.glm lambda search not appearing to iterate over all lambdas

我的梦境 提交于 2019-12-01 06:22:57
Please consider the following basic reproducible example: library(h2o) h2o.init() data("iris") iris.hex = as.h2o(iris, "iris.hex") mod = h2o.glm(y = "Sepal.Length", x = setdiff(colnames(iris), "Sepal.Length"), training_frame = iris.hex, nfolds = 2, seed = 100, lambda_search = T, early_stopping = F, family = "gamma", nlambdas = 100) When I run the above, I expect that h2o will iterate over 100 different values of lambda. However, running length(mod@allparameters$lambda) will show that only 79 values of lambda were actually tested. These 79 values are the first 79 values in the sequence:

h2o.glm lambda search not appearing to iterate over all lambdas

雨燕双飞 提交于 2019-12-01 05:27:37
问题 Please consider the following basic reproducible example: library(h2o) h2o.init() data("iris") iris.hex = as.h2o(iris, "iris.hex") mod = h2o.glm(y = "Sepal.Length", x = setdiff(colnames(iris), "Sepal.Length"), training_frame = iris.hex, nfolds = 2, seed = 100, lambda_search = T, early_stopping = F, family = "gamma", nlambdas = 100) When I run the above, I expect that h2o will iterate over 100 different values of lambda. However, running length(mod@allparameters$lambda) will show that only 79

Regression for a Rate variable in R

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-01 03:47:07
问题 I was tasked with developing a regression model looking at student enrollment in different programs. This is a very nice, clean data set where the enrollment counts follow a Poisson distribution well. I fit a model in R (using both GLM and Zero Inflated Poisson.) The resulting residuals seemed reasonable. However, I was then instructed to change the count of students to a "rate" which was calculated as students / school_population (Each school has its own population.)) This is now no longer a

Saving a single object within a function in R: RData file size is very large

一世执手 提交于 2019-11-29 23:55:45
问题 I am trying to save trimmed-down GLM objects in R (i.e. with all the "non-essential" characteristics set to NULL e.g. residuals, prior.weights, qr$qr). As an example, looking at the smallest object that I need to do this with: print(object.size(glmObject)) 168992 bytes save(glmObject, "FileName.RData") Assigning this object in the global environment and saving leads to an RData file of about 6KB. However, I effectively need to create and save the glm object within a function, which is in

Why is caret train taking up so much memory?

一笑奈何 提交于 2019-11-29 22:55:09
When I train just using glm , everything works, and I don't even come close to exhausting memory. But when I run train(..., method='glm') , I run out of memory. Is this because train is storing a lot of data for each iteration of the cross-validation (or whatever the trControl procedure is)? I'm looking at trainControl and I can't find how to prevent this...any hints? I only care about the performance summary and maybe the predicted responses. (I know it's not related to storing data from each iteration of the parameter-tuning grid search because there's no grid for glm's, I believe.) The

Why is it inadvisable to get statistical summary information for regression coefficients from glmnet model?

橙三吉。 提交于 2019-11-29 20:52:04
I have a regression model with binary outcome. I fitted the model with glmnet and got the selected variables and their coefficients. Since glmnet doesn't calculate variable importance, I would like to feed the exact output (selected variables and their coefficients) to glm to get the information (Standard errors, etc). I searched r documents, it seems I can use "method" option in glm to specify user defined function. But I failed to do so, could someone help me with this? "It is a very natural question to ask for standard errors of regression coefficients or other estimated quantities. In

How to update `lm` or `glm` model on same subset of data?

北城以北 提交于 2019-11-29 16:00:45
I am trying to fit two nested models and then test those against each other using anova function. The commands used are: probit <- glm(grad ~ afqt1 + fhgc + mhgc + hisp + black + male, data=dt, family=binomial(link = "probit")) nprobit <- update(probit, . ~ . - afqt1) anova(nprobit, probit, test="Rao") However, the variable afqt1 apparently contains NA s and because the update call does not take the same subset of data, anova() returns error Error in anova.glmlist(c(list(object), dotargs), dispersion = dispersion, : models were not all fitted to the same size of dataset Is there a simple way

model.matrix(): why do I lose control of contrast in this case

谁说胖子不能爱 提交于 2019-11-29 12:46:31
Suppose we have a toy data frame: x <- data.frame(x1 = gl(3, 2, labels = letters[1:3]), x2 = gl(3, 2, labels = LETTERS[1:3])) I would like to construct a model matrix # x1b x1c x2B x2C # 1 0 0 0 0 # 2 0 0 0 0 # 3 1 0 1 0 # 4 1 0 1 0 # 5 0 1 0 1 # 6 0 1 0 1 by: model.matrix(~ x1 + x2 - 1, data = x, contrasts.arg = list(x1 = contr.treatment(letters[1:3]), x2 = contr.treatment(LETTERS[1:3]))) but actually I get: # x1a x1b x1c x2B x2C # 1 1 0 0 0 0 # 2 1 0 0 0 0 # 3 0 1 0 1 0 # 4 0 1 0 1 0 # 5 0 0 1 0 1 # 6 0 0 1 0 1 # attr(,"assign") # [1] 1 1 1 2 2 # attr(,"contrasts") # attr(,"contrasts")$x1 #

Why is it inadvisable to get statistical summary information for regression coefficients from glmnet model?

删除回忆录丶 提交于 2019-11-28 17:09:58
问题 I have a regression model with binary outcome. I fitted the model with glmnet and got the selected variables and their coefficients. Since glmnet doesn't calculate variable importance, I would like to feed the exact output (selected variables and their coefficients) to glm to get the information (Standard errors, etc). I searched r documents, it seems I can use "method" option in glm to specify user defined function. But I failed to do so, could someone help me with this? 回答1: "It is a very