ID3 and C4.5: How Does “Gain Ratio” Normalize “Gain”?
问题 The ID3 algorithm uses "Information Gain" measure. The C4.5 uses "Gain Ratio" measure which is Information Gain divided by SplitInfo , whereas SplitInfo is high for a split where records split evenly between different outcomes and low otherwise. My question is: How does this help to solve the problem that Information Gain is biased towards splits with many outcomes? I can't see the reason. SplitInfo doesn't even take into account the number of outcomes, just the distribution of records in the