Dummy Variable in Multiple Linear Regression

为君一笑 提交于 2020-01-24 20:38:10

问题


Why do we take one less dummy variable than the total number of dummy variables in a Multiple Linear regression model?

Like, if the model contains 4 dummy variables, we update our features vector for training our regression model. x = x[:, 1:4].


回答1:


Because of the Dummy Variable Trap.

By including dummy variable in a regression model however, one should be careful of the Dummy Variable Trap. The Dummy Variable trap is a scenario in which the independent variables are multicollinear - a scenario in which two or more variables are highly correlated; in simple terms one variable can be predicted from the others.

Let's say you have a simple categorical like gender, with categories «male» and «female». You get two dummy variables «male» and «female», which can either be true or false. This simply is redundant because you can predict one from the other.

In another example: When you have four categoricals A/B/C/D, you get four dummy variables. If you know that the class is not A, B or C, you know it must be D. Therefore you can and should drop one dummy variable.

Technically, the dummy variable trap is a scenario in which the independent variables are multi-collinear - two or more variables are highly correlated. This will lead to problems in your regression algorithm:

In this situation the coefficient estimates of the multiple regression may change erratically in response to small changes in the model or the data.

Baseline: When modelling a categorical variable with N possible values, you should use N−1 dummy variables.



来源:https://stackoverflow.com/questions/51914169/dummy-variable-in-multiple-linear-regression

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!