I have a dataset where the classes are unbalanced. The classes are either \'1\' or \'0\' where the ratio of class \'1\':\'0\' is 5:1. How do you calculate the prediction e
This is really a shame that sklearn's "fit" method does not allow specifying a performance measure to be optimized. No one around seem to understand or question or be interested in what's actually going on when one calls fit method on data sample when solving a classification task.
We (users of the scikit learn package) are silently left with suggestion to indirectly use crossvalidated grid search with specific scoring method suitable for unbalanced datasets in hope to stumble upon a parameters/metaparameters set which produces appropriate AUC or F1 score.
But think about it: looks like "fit" method called under the hood each time always optimizes accuracy. So in end effect, if we aim to maximize F1 score, GridSearchCV gives us "model with best F1 from all modesl with best accuracy". Is that not silly? Would not it be better to directly optimize model's parameters for maximal F1 score? Remember old good Matlab ANNs package, where you can set desired performance metric to RMSE, MAE, and whatever you want given that gradient calculating algo is defined. Why is choosing of performance metric silently omitted from sklearn?
At least, why there is no simple option to assign class instances weights automatically to remedy unbalanced datasets issues? Why do we have to calculate wights manually? Besides, in many machine learning books/articles I saw authors praising sklearn's manual as awesome if not the best sources of information on topic. No, really? Why is unbalanced datasets problem (which is obviously of utter importance to data scientists) not even covered nowhere in the docs then? I address these questions to contributors of sklearn, should they read this. Or anyone knowing reasons for doing that welcome to comment and clear things out.
UPDATE
Since scikit-learn 0.17, there is class_weight='balanced' option which you can pass at least to some classifiers:
The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)).