Binary decision tree model when the proportion of one of the labels is almost null

问题

I want to make a decision tree with two options to predict; "YES" or "NO". The dataset I am working with has 99% of "YES" answers and only 1% of "NO" answers. As I ran the model, the score is up to 97% of accuracy.

Is it a valid model or are there any considerations to take into account when working with this kind of unbalanced proportions?

I am afraid that because of the large amount of "YES" data, the model is very accurate by saying the answer to everything is "YES". The "NO"s are very important to this use of case, in fact, that is what we want to identify

回答1:

No, your benchmark has to be 99%. Because having a model that uses a basic average to predict (resulting in predicting "YES" always), will have a 99% accuracy. These cases are better evaluated by using roc or auroc instead of accuracy. When working with extremely unbalanced data, most of the times it is a rule of thumb to benchmark on the proportion of the data belonging to the dominant class.

来源：https://stackoverflow.com/questions/57700105/binary-decision-tree-model-when-the-proportion-of-one-of-the-labels-is-almost-nu

标签

machine-learning

decision-tree

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!