Dummy Variable in Multiple Linear Regression

问题

Why do we take one less dummy variable than the total number of dummy variables in a Multiple Linear regression model?

Like, if the model contains 4 dummy variables, we update our features vector for training our regression model. x = x[:, 1:4].

回答1:

Because of the Dummy Variable Trap.

By including dummy variable in a regression model however, one should be careful of the Dummy Variable Trap. The Dummy Variable trap is a scenario in which the independent variables are multicollinear - a scenario in which two or more variables are highly correlated; in simple terms one variable can be predicted from the others.

Let's say you have a simple categorical like gender, with categories «male» and «female». You get two dummy variables «male» and «female», which can either be true or false. This simply is redundant because you can predict one from the other.

In another example: When you have four categoricals A/B/C/D, you get four dummy variables. If you know that the class is not A, B or C, you know it must be D. Therefore you can and should drop one dummy variable.

Technically, the dummy variable trap is a scenario in which the independent variables are multi-collinear - two or more variables are highly correlated. This will lead to problems in your regression algorithm:

In this situation the coefficient estimates of the multiple regression may change erratically in response to small changes in the model or the data.

Baseline: When modelling a categorical variable with N possible values, you should use N−1 dummy variables.

来源：https://stackoverflow.com/questions/51914169/dummy-variable-in-multiple-linear-regression

标签

machine-learning

linear-regression