问题
I want to make a decision tree with two options to predict; "YES" or "NO". The dataset I am working with has 99% of "YES" answers and only 1% of "NO" answers. As I ran the model, the score is up to 97% of accuracy.
Is it a valid model or are there any considerations to take into account when working with this kind of unbalanced proportions?
I am afraid that because of the large amount of "YES" data, the model is very accurate by saying the answer to everything is "YES". The "NO"s are very important to this use of case, in fact, that is what we want to identify
回答1:
No, your benchmark has to be 99%. Because having a model that uses a basic average to predict (resulting in predicting "YES" always), will have a 99% accuracy. These cases are better evaluated by using roc or auroc instead of accuracy. When working with extremely unbalanced data, most of the times it is a rule of thumb to benchmark on the proportion of the data belonging to the dominant class.
来源:https://stackoverflow.com/questions/57700105/binary-decision-tree-model-when-the-proportion-of-one-of-the-labels-is-almost-nu