How to specify split in a decision tree in R programming?

扶醉桌前 提交于 2019-12-29 07:14:31

问题


I am trying to apply decision tree here. Decision tree takes care of splitting at each node itself. But at first node I want to split my tree on the basis of "Age". How do I force that.?

library(party)    
fit2 <- ctree(Churn ~ Gender + Age + LastTransaction + Payment.Method + spend + marStat, data = tsdata)

回答1:


There is no built-in option to do that in ctree(). The easiest method to do this "by hand" is simply:

  1. Learn a tree with only Age as explanatory variable and maxdepth = 1 so that this only creates a single split.

  2. Split your data using the tree from step 1 and create a subtree for the left branch.

  3. Split your data using the tree from step 1 and create a subtree for the right branch.

This does what you want (although I typically wouldn't recommend to do so...).

If you use the ctree() implementation from partykit you can also merge these three trees into a single tree again for visualizations and predictions etc. It requires a bit of hacking but is still feasible.

I will illustrate this using the iris data and I will force a split in the variable Sepal.Length which otherwise wouldn't be used in the tree. Learning the three trees above is easy:

library("partykit")
data("iris", package = "datasets")
tr1 <- ctree(Species ~ Sepal.Length,     data = iris, maxdepth = 1)
tr2 <- ctree(Species ~ Sepal.Length + ., data = iris,
  subset = predict(tr1, type = "node") == 2)
tr3 <- ctree(Species ~ Sepal.Length + ., data = iris,
  subset = predict(tr1, type = "node") == 3)

Note, however, that it is important to use the formula with Sepal.Length + . to assure that the variables in the model frame are ordered in exactly the same way in all trees.

Next comes the most technical step: We need do extract the raw node structure from all three trees, fix-up the node ids so that they are in a proper sequence and then integrate everything into a single node:

fixids <- function(x, startid = 1L) {
  id <- startid - 1L
  new_node <- function(x) {
    id <<- id + 1L
    if(is.terminal(x)) return(partynode(id, info = info_node(x)))
    partynode(id,
      split = split_node(x),
      kids = lapply(kids_node(x), new_node),
      surrogates = surrogates_node(x),
      info = info_node(x))
  }

  return(new_node(x))   
}
no <- node_party(tr1)
no$kids <- list(
  fixids(node_party(tr2), startid = 2L),
  fixids(node_party(tr3), startid = 5L)
)
no
## [1] root
## |   [2] V2 <= 5.4
## |   |   [3] V4 <= 1.9 *
## |   |   [4] V4 > 1.9 *
## |   [5] V2 > 5.4
## |   |   [6] V4 <= 4.7
## |   |   |   [7] V4 <= 3.6 *
## |   |   |   [8] V4 > 3.6 *
## |   |   [9] V4 > 4.7
## |   |   |   [10] V5 <= 1.7 *
## |   |   |   [11] V5 > 1.7 *

And finally we set up a joint model frame containing all data and combine that with the new joint tree. Some information on fitted nodes and the response is added to be able to turn the tree into a constparty for nice visualization and predictions. See vignette("partykit", package = "partykit") for the background on this:

d <- model.frame(Species ~ Sepal.Length + ., data = iris)
tr <- party(no, 
  data = d,
  fitted = data.frame(
    "(fitted)" = fitted_node(no, data = d),
    "(response)" = model.response(d),
    check.names = FALSE),
  terms = terms(d),
)
tr <- as.constparty(tr)

And then we're done and can visualize our combined tree with the forced first split:

plot(tr)




回答2:


At every iteration, a decision tree will choose the best variable for splitting (either based on information gain / gini index, for CART, or based on chi-square test as for conditional inference tree). If you have better predictor variable that separates the classes more than that can be done by the predictor Age, then that variable will be chosen first.

I think based on your requirement, you can do the following couple of things:

(1) Unsupervised: Discretize the Age variable (create bins e.g., 0-20, 20-40, 40-60 etc., as per your domain knowledge) and subset the data for each of the age bins, then train a separate decision tree on each of these segments.

(2) Supervised: Keep on dropping the other predictor variables until Age is chosen first. Now, you will get a decision tree where Age is chosen as the first variable. Use the rules for Age (e.g., Age > 36 & Age <= 36) created by the decision tree to subset the data into 2 parts. On each of the parts learn a full decision tree with all the variables separately.

(3) Supervised Ensemble: you can use Randomforest classifier to see how important your Age variable is.




回答3:


You can use rpart and partykit combination to achieve such operation.

Notice that if you use ctree to train DT then use data_party function to extract data from different node, the only variables included in the extracted data set would be the training variables only, in your case Age.

We have to use rpart in the first step to train the model with selected variable because there is a way using rpart to train DT such that you can keep all your variables in the extracted data set without putting those variables as training variables:

library(rpart)
fit2 <- rpart(Churn ~ . -(Gendere + LastTransaction + Payment.Method + spend + marStat) , data = tsdata, maxdepth = 1)

Using this method, your only training variable would be Age, and you can convert your rpart tree to partykit and extract data from different node and train them desperately:

library(partykit)
fit2party <- as.party(fit2)
dataset1 <- data_party(fit2party, id = 2)
dataset2 <- data_party(fit2party, id = 3)

Now you have two dataset split based on Age with all the variables you want to use to train DT in the future, you can build DT based on those subsets however you see fit, use rpart or ctree.

Later you can use partynode and partysplit combo to construct the tree based on training rules you achieved.

Hope this is what you are looking for.



来源:https://stackoverflow.com/questions/39844830/how-to-specify-split-in-a-decision-tree-in-r-programming

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!