Decision trees. Choosing thresholds to split objects

只愿长相守 提交于 2020-03-21 05:15:10

问题


If I understand this correctly, a set of objects (which are arrays of features) is presented and we need to split it into 2 subsets. To do that we compare some feature xj to a threshold tm (tm is the threshold at m node). We use an impurity function H() to find the best way to split the objects. But how do we choose the values of tm and which feature should be compared to the thresholds? I mean, there is an infinite number of ways we can choose tm so we can't just compute H() function for each possibility.


回答1:


In Page 18 of these slides, two methods are introduced to choose the splitting threshold for a numerical attribute X.

Method 1:

  • Sort data according to X into {x_1, ..., x_m}
  • Consider split points of the form x_i + (x_{i+1} - x_i)/2

Method 2:

Suppose X is a real-value variable

  • Define IG(Y|X:t) as H(Y) - H(Y|X:t)

  • Define H(Y|X:t) = H(Y|X < t) P(X < t) + H(Y|X >= t) P(X >= t)

    • IG(Y|X:t) is the information gain for predicting Y if all you know is whether X is greater than or less than t
  • Then define IG^*(Y|X) = max_t IG(Y|X:t)

  • For each real-valued attribute, use IG*(Y|X) for assessing its suitability as a split

Note, may split on an attribute multiple times, with different thresholds




回答2:


There isn't really an infinite number of ways of choosing tm. Given a reasonable range of thresholds a simple implementation might iterate over them, evaluate H() and the feature split that would result in the best split given that impurity measure would be chosen for that split in the decision tree.



来源:https://stackoverflow.com/questions/45513511/decision-trees-choosing-thresholds-to-split-objects

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!