可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
When I try to define my linear model in R as follows:
lm1 <- lm(predictorvariable ~ x1+x2+x3, data=dataframe.df)
I get the following error message:
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels
Is there any way to ignore this or fix it? Some of the variables are factors and some are not.
回答1:
If your independent variable (RHS variable) is a factor or a character taking only one value then that type of error occurs.
Example: iris data in R
(model1 <- lm(Sepal.Length ~ Sepal.Width + Species, data=iris)) # Call: # lm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris) # Coefficients: # (Intercept) Sepal.Width Speciesversicolor Speciesvirginica # 2.2514 0.8036 1.4587 1.9468
Now, if your data consists of only one species:
(model1 <- lm(Sepal.Length ~ Sepal.Width + Species, data=iris[iris$Species == "setosa", ])) # Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : # contrasts can be applied only to factors with 2 or more levels
If the variable is numeric (Sepal.Width) but taking only a single value say 3, then the model runs but you will get NA as coefficient of that variable as follows:
(model2 <-lm(Sepal.Length ~ Sepal.Width + Species, data=iris[iris$Sepal.Width == 3, ])) # Call: # lm(formula = Sepal.Length ~ Sepal.Width + Species, # data = iris[iris$Sepal.Width == 3, ]) # Coefficients: # (Intercept) Sepal.Width Speciesversicolor Speciesvirginica # 4.700 NA 1.250 2.017
Solution: There is not enough variation in dependent variable with only one value. So, you need to drop that variable, irrespective of whether that is numeric or character or factor variable.
Updated as per comments: Since you know that the error will only occur with factor/character, you can focus only on those and see whether the length of levels of those factor variables is 1 (DROP) or greater than 1 (NODROP).
To see, whether the variable is a factor or not, use the following code:
(l <- sapply(iris, function(x) is.factor(x))) # Sepal.Length Sepal.Width Petal.Length Petal.Width Species # FALSE FALSE FALSE FALSE TRUE
Then you can get the data frame of factor variables only
m <- iris[, l]
Now, find the number of levels of factor variables, if this is one you need to drop that
ifelse(n <- sapply(m, function(x) length(levels(x))) == 1, "DROP", "NODROP")
Note: If the levels of factor variable is only one then that is the variable, you have to drop.
回答2:
It appears that at least one of your predictors ,x1, x2, or x3, has only one factor level and hence is a constant.
Have a look at
lapply(dataframe.df[c("x1", "x2", "x3")], unique)
to find the different values.
回答3:
gives the contrast error, while Levels <- c("Pri", "For") doesn't
This is probably a bug.
回答4:
This is a variation to the answer provided by @Metrics and edited by @Max Ghenis...
l <- sapply(iris, function(x) is.factor(x)) m <- iris[,l] n <- sapply( m, function(x) { y <- summary(x)/length(x) len <- length(y[y<0.005 | y>0.995]) cbind(len,t(y))} ) drop_cols_df <- data.frame(var = names(l[l]), status = ifelse(as.vector(t(n[1,]))==0,"NODROP","DROP" ), level1 = as.vector(t(n[2,])), level2 = as.vector(t(n[3,])))
Here, after identifying factor variables, the second sapply computes what percent of records belong to each level / category of the variable. Then it identifies number of levels over 99.5% or below 0.5% incidence rate (my arbitrary thresholds).
It then goes on to return the number of valid levels and the incidence rate of each level in each categorical variable.
Variables with zero levels crossing the thresholds should not be dropped, while the other should be dropped from the linear model.
The last data frame makes viewing the results easy. It's hard coded for this data set since all factor variables are binomial. This data frame can be made generic easily enough.
回答5:
This error message may also happen when the data contains NAs.
In this case, the behaviour depends on the defaults (see documentation), and maybe all cases with NA's in the columns mentioned in the variables are silently dropped. So it may be that a factor does indeed have several outcomes, but the factor only has one outcome when restricting to the cases without NA's.
In this case, to fix the error, either change the model (remove the problematic factor from the formula), or change the data (i.e. complete the cases).