How to calculate the threshold value for numeric attributes in Quinlan's C4.5 algorithm?

纵饮孤独 提交于 2019-11-30 14:56:35

As your generated tree image implies, you consider attributes in order. Your 75 example belongs to outlook = sunny branch. If you filter your data according to outlook = sunny, you get following table.

outlook temperature humidity    windy   play
sunny   69           70         FALSE   yes
sunny   75           70         TRUE    yes
sunny   85           85         FALSE   no
sunny   80           90         TRUE    no
sunny   72           95         FALSE   no

As you can see, threshold for humidity is "< 75" for this condition.

j4.8 is successor to ID3 algorithm. It uses information gain and entropy to decide best split. According to wikipedia

The attribute with the smallest entropy 
is used to split the set on this iteration. 
The higher the entropy, 
the higher the potential to improve the classification here.

I'm not entirely sure about J48, but assuming its based on C4.5 it would compute the gain for all possible splits (i.e., based on the possible values for the feature). For each split, it computes the information gain and chooses the split with the most information gain. In the case of {70,85,90,95} it would compute the information gain for {70|85,90,95} vs {70,85|90,95} vs {70,85,90|95} and choose the best one.

Quinlan's book on C4.5 book is a good starting point (https://goo.gl/J2SsPf). See page 25 in particular.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!