I am trying to find how the C4.5 algorithm determines the threshold value for numeric attributes. I have researched and can not understand, in most places I\'ve found this i
As your generated tree image implies, you consider attributes in order. Your 75 example belongs to outlook = sunny branch. If you filter your data according to outlook = sunny, you get following table.
outlook temperature humidity windy play
sunny 69 70 FALSE yes
sunny 75 70 TRUE yes
sunny 85 85 FALSE no
sunny 80 90 TRUE no
sunny 72 95 FALSE no
As you can see, threshold for humidity is "< 75" for this condition.
j4.8 is successor to ID3 algorithm. It uses information gain and entropy to decide best split. According to wikipedia
The attribute with the smallest entropy
is used to split the set on this iteration.
The higher the entropy,
the higher the potential to improve the classification here.