R decision tree using all the variables

我的梦境 提交于 2019-12-08 08:45:39

问题


I would like to perform a decision tree analysis. I want that the decision tree uses all the variables in the model.

I also need to plot the decision tree. How can I do that in R?

This is a sample of my dataset

> head(d)
  TargetGroup2000 TargetGroup2012 SmokingGroup_Kai PA_Score wheeze3 asthma3 tres3
1               2               2                4        2       0       0     0
2               2               2                4        3       1       0     0
3               2               2                5        1       0       0     0
4               2               2                4        2       1       0     0
5               2               3                3        1       0       0     0
6               2               3                3        2       0       0     0
> 

I would like to use the formula

myFormula <- wheeze3 ~ TargetGroup2000 + TargetGroup2012 + SmokingGroup_Kai + PA_Score

Note that all the variables are categorical.

EDIT: My problem is that some variables do not appear in the final decision tree. The deap of the tree should be defined by a penalty parameter alpha. I do not know how to set this penalty in order that all the variables appear in my model.
In other words I would like a model that minimize the training error.


回答1:


As mentioned above, if you want to run the tree on all the variables you should write it as

ctree(wheeze3 ~ ., d)

The penalty you mentioned is located at the ctree_control(). You can set the P-value there and the minimum split and bucket size. So in order to maximize the chance that all the variables will be included you should do something like that:

ctree(wheeze3 ~ ., d, controls = ctree_control(mincriterion = 0.85, minsplit = 0, minbucket = 0))

The problem is that you'll get into risk of overfitting.

The last thing you need to understand is, that the reason that you may not see all the variables in the output of the tree is because they don't have a significant influence on the dependend variable. Unlike linear or logistic regression, that will show all the variables and give you the P-value in order to determine if they are significant or not, the decision tree does not return the unsiginifcant variables, i.e, it doesn't split by them.

For better understanding of how ctree works, please take a look here: https://stats.stackexchange.com/questions/12140/conditional-inference-trees-vs-traditional-decision-trees




回答2:


The easiest way is to use the rpart package that is part of the core R.

library(rpart) 
model <- rpart( wheeze3 ~ ., data=d ) 

summary(model)
plot(model)
text(model)

The . in the formula argument means use all the other variables as independent variables.




回答3:


          plot(ctree(myFormula~., data=sta))


来源:https://stackoverflow.com/questions/22443554/r-decision-tree-using-all-the-variables

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!