R caret / rfe variable selection for factors() AND NAs

不羁的心 提交于 2019-12-03 21:45:30

问题


I have a data set with NAs sprinkled generously throughout.

In addition it has columns that need to be factors().

I am using the rfe() function from the caret package to select variables.

It seems the functions= argument in rfe() using lmFuncs works for the data with NAs but NOT on factor variables, while the rfFuncs works for factor variables but NOT NAs.

Any suggestions for dealing with this?

I tried model.matrix() but it seems to just cause more problems.


回答1:


Because of inconsistent behavior on these points between packages, not to mention the extra trickiness when going to more "meta" packages like caret, I always find it easier to deal with NAs and factor variables up front, before I do any machine learning.

  • For NAs, either omit or impute (median, knn, etc.).
  • For factor features, you were on the right track with model.matrix(). It will let you generate a series of "dummy" features for the different levels of the factor. The typical usage is something like this:
> dat = data.frame(x=factor(rep(1:3, each=5)))
> dat$x
 [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
Levels: 1 2 3
> model.matrix(~ x - 1, data=dat)
   x1 x2 x3
1   1  0  0
2   1  0  0
3   1  0  0
4   1  0  0
5   1  0  0
6   0  1  0
7   0  1  0
8   0  1  0
9   0  1  0
10  0  1  0
11  0  0  1
12  0  0  1
13  0  0  1
14  0  0  1
15  0  0  1
attr(,"assign")
[1] 1 1 1
attr(,"contrasts")
attr(,"contrasts")$x
[1] "contr.treatment"

Also, just in case you haven't (although it sounds like you have), the caret vignettes on CRAN are very nice and touch on some of these points. http://cran.r-project.org/web/packages/caret/index.html



来源:https://stackoverflow.com/questions/9145874/r-caret-rfe-variable-selection-for-factors-and-nas

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!