R caret / rfe variable selection for factors() AND NAs

北战南征 提交于 2019-12-01 00:16:37

Because of inconsistent behavior on these points between packages, not to mention the extra trickiness when going to more "meta" packages like caret, I always find it easier to deal with NAs and factor variables up front, before I do any machine learning.

  • For NAs, either omit or impute (median, knn, etc.).
  • For factor features, you were on the right track with model.matrix(). It will let you generate a series of "dummy" features for the different levels of the factor. The typical usage is something like this:
> dat = data.frame(x=factor(rep(1:3, each=5)))
> dat$x
 [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
Levels: 1 2 3
> model.matrix(~ x - 1, data=dat)
   x1 x2 x3
1   1  0  0
2   1  0  0
3   1  0  0
4   1  0  0
5   1  0  0
6   0  1  0
7   0  1  0
8   0  1  0
9   0  1  0
10  0  1  0
11  0  0  1
12  0  0  1
13  0  0  1
14  0  0  1
15  0  0  1
attr(,"assign")
[1] 1 1 1
attr(,"contrasts")
attr(,"contrasts")$x
[1] "contr.treatment"

Also, just in case you haven't (although it sounds like you have), the caret vignettes on CRAN are very nice and touch on some of these points. http://cran.r-project.org/web/packages/caret/index.html

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!