How to compute precision, recall, accuracy and f1-score for the multiclass case with scikit learn?

后端 未结 4 1060
渐次进展
渐次进展 2020-12-04 04:58

I\'m working in a sentiment analysis problem the data looks like this:

label instances
    5    1190
    4     838
    3     239
    1     204
    2     127
         


        
4条回答
  •  [愿得一人]
    2020-12-04 05:22

    First of all it's a little bit harder using just counting analysis to tell if your data is unbalanced or not. For example: 1 in 1000 positive observation is just a noise, error or a breakthrough in science? You never know.
    So it's always better to use all your available knowledge and choice its status with all wise.

    Okay, what if it's really unbalanced?
    Once again — look to your data. Sometimes you can find one or two observation multiplied by hundred times. Sometimes it's useful to create this fake one-class-observations.
    If all the data is clean next step is to use class weights in prediction model.

    So what about multiclass metrics?
    In my experience none of your metrics is usually used. There are two main reasons.
    First: it's always better to work with probabilities than with solid prediction (because how else could you separate models with 0.9 and 0.6 prediction if they both give you the same class?)
    And second: it's much easier to compare your prediction models and build new ones depending on only one good metric.
    From my experience I could recommend logloss or MSE (or just mean squared error).

    How to fix sklearn warnings?
    Just simply (as yangjie noticed) overwrite average parameter with one of these values: 'micro' (calculate metrics globally), 'macro' (calculate metrics for each label) or 'weighted' (same as macro but with auto weights).

    f1_score(y_test, prediction, average='weighted')
    

    All your Warnings came after calling metrics functions with default average value 'binary' which is inappropriate for multiclass prediction.
    Good luck and have fun with machine learning!

    Edit:
    I found another answerer recommendation to switch to regression approaches (e.g. SVR) with which I cannot agree. As far as I remember there is no even such a thing as multiclass regression. Yes there is multilabel regression which is far different and yes it's possible in some cases switch between regression and classification (if classes somehow sorted) but it pretty rare.

    What I would recommend (in scope of scikit-learn) is to try another very powerful classification tools: gradient boosting, random forest (my favorite), KNeighbors and many more.

    After that you can calculate arithmetic or geometric mean between predictions and most of the time you'll get even better result.

    final_prediction = (KNNprediction * RFprediction) ** 0.5
    

提交回复
热议问题