Linear model (lm) when dependent variable is a factor/categorical variable?

问题

I want to do linear regression with the lm function. My dependent variable is a factor called AccountStatus:

1:0 days in arrears, 2:30-60 days in arrears, 3:60-90 days in arrears and 4:90+ days in arrears. (4)

As independent variable I have several numeric variables: Loan to value, debt to income and interest rate.

Is it possible to do a linear regression with these variables? I looked on the internet and found something about dummy's, but those were all for the independent variable.

This did not work:

fit <- lm(factor(AccountStatus) ~ OriginalLoanToValue, data=mydata)
summary(fit)

回答1:

Linear regression does not take categorical variables for the dependent part, it has to be continuous. Considering that your AccountStatus variable has only four levels, it is unfeasible to treat it is continuous. Before commencing any statistical analysis, one should be aware of the measurement levels of one's variables.

What you can do is use multinomial logistic regression, see here for instance. Alternatively, you can recode the AccountStatus as dichotomous and use simple logistic regression.

Sorry to disappoint you, but this is just an inherent restriction of multiple regression, nothing to do with R really. If you want to learn more about which statistical technique is appropriate for different combinations of measurement levels of dependent and independent variables, I can wholeheartedly advise this book.

回答2:

If you can give a numeric value to the variables then you might have a solution. You have to rename the values to numbers, then convert the variable into a numeric one. Here is how:

library(plyr)
my.data2$islamic_leviathan_score <- revalue(my.data2$islamic_leviathan,
               c("(1) Very Suitable"="3", "(2) Suitable"="2", "(3) Somewhat Suitable"="1", "(4) Not Suitable At All"="-1"))

my.data2$islamic_leviathan_score_1 <- as.numeric(as.character(my.data2$islamic_leviathan_score))

This revaluates the potential values while transforming the variable as numeric ones. The results I get are consistent with the original values contained in the dataset when the variables are as factor variables. You can use this solution to change the name of the variables to whatever you may like, while transforming them to numeric variables.

Finally, this is worth doing because it allows you to draw histograms or regressions, something that is impossible to do with factor variables.

Hope this helps!

回答3:

Expanding a little bit on @MaximK's answer: multinomial approaches are appropriate when the levels of the factor are unordered: in your case, however, when the measurement level is ordinal (i.e. ordered, but the distance between the levels is unknown/undefined), you can get more out of your data by doing ordinal regression, e.g. with the polr() function in the MASS package or with functions in the ordinal package. However, since ordinal regression has different/more complex underlying theory than simple linear regression, you should probably read more about it (e.g. at the Wikipedia article linked above, or in the vignettes of the ordinal package, or at the UCLA stats consulting page on ordinal regression, or browsing related questions on CrossValidated.

来源：https://stackoverflow.com/questions/22192934/linear-model-lm-when-dependent-variable-is-a-factor-categorical-variable

标签

categorical-data

r-factor