Different results with formula and non-formula for caret training

别来无恙 提交于 2019-12-03 08:36:40

You have a categorical predictor with a moderate number of levels. When you use the formula interface, most modeling functions (including train, lm, glm, etc) internally run model.matrix to process the data set. This will create dummy variables from any factor variables. The non-formula interface does not [1].

When you use dummy variables, only one factor level is used in any split. Tree methods handle categorical predictors differently but, when dummy variables are not used, random forest will sort the factor predictors based on their outcome and find a 2-way split of the factor levels [2]. This takes more time.

Max

[1] I hate to be one of those people who says "in my book I show..." but in this case I will. Fig. 14.2 has a good illustration of this process for CART trees.

[2] God, I'm doing it again. The different representations of factors for trees is discussed in section 14.1 and a comparison between the two approaches for one data set is shown in section 14.7

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!