roc_auc_score - Only one class present in y_true

前端 未结 5 1484
忘了有多久
忘了有多久 2020-12-16 14:28

I am doing a k-fold XV on an existing dataframe, and I need to get the AUC score. The problem is - sometimes the test data only contains 0s, and not 1s!

I tried usin

相关标签:
5条回答
  • 2020-12-16 14:44

    You could use try-except to prevent the error:

    import numpy as np
    from sklearn.metrics import roc_auc_score
    y_true = np.array([0, 0, 0, 0])
    y_scores = np.array([1, 0, 0, 0])
    try:
        roc_auc_score(y_true, y_scores)
    except ValueError:
        pass
    

    Now you can also set the roc_auc_score to be zero if there is only one class present. However, I wouldn't do this. I guess your test data is highly unbalanced. I would suggest to use stratified K-fold instead so that you at least have both classes present.

    0 讨论(0)
  • 2020-12-16 14:48

    As the error notes, if a class is not present in the ground truth of a batch,

    ROC AUC score is not defined in that case.

    I'm against either throwing an exception (about what? This is the expected behaviour) or returning another metric (e.g. accuracy). The metric is not broken per se.

    I don't feel like solving a data imbalance "issue" with a metric "fix". It would probably be better to use another sampling, if possibile, or just join multiple batches that satisfy the class population requirement.

    0 讨论(0)
  • 2020-12-16 14:59

    I am facing the same problem now, and using try-catch does not solve my issue. I developed the code below in order to deal with that.

    import pandas as pd
    import numpy as np
    
    class KFold(object):
    
        def __init__(self, folds, random_state=None):
    
            self.folds = folds
    
            self.random_state = random_state
    
        def split(self, x, y):
    
            assert len(x) == len(y), 'x and y should have the same length'
    
            x_, y_ = pd.DataFrame(x), pd.DataFrame(y)
    
            y_ = y_.sample(frac=1, random_state=self.random_state)
    
            x_ = x_.loc[y_.index]
    
            event_index, non_event_index = list(y_[y == 1].index), list(y_[y == 0].index)
    
            assert len(event_index) >= self.folds, 'number of folds should be less than the number of rows in x'
    
            assert len(non_event_index) >= self.folds, 'number of folds should be less than number of rows in y'
    
            indexes = []
    
            #
            #
            #
            step = int(np.ceil(len(non_event_index) / self.folds))
    
            start, end = 0, step
    
            while start < len(non_event_index):
    
                train_fold = set(non_event_index[start:end])
    
                valid_fold = set([k for k in non_event_index if k not in train_fold])
    
                indexes.append([train_fold, valid_fold])
    
                start, end = end, min(step + end, len(non_event_index))
    
    
            #
            #
            #
            step = int(np.ceil(len(event_index) / self.folds))
    
            start, end, i = 0, step, 0
    
            while start < len(event_index):
    
                train_fold = set(event_index[start:end])
    
                valid_fold = set([k for k in event_index if k not in train_fold])
    
                indexes[i][0] = list(indexes[i][0].union(train_fold))
    
                indexes[i][1] = list(indexes[i][1].union(valid_fold))
    
                indexes[i] = tuple(indexes[i])
    
                start, end, i = end, min(step + end, len(event_index)), i + 1
    
            return indexes 
    

    I just wrote that code and I did not tested it exhaustively. It was tested only for binary categories. Hope it be useful yet.

    0 讨论(0)
  • 2020-12-16 15:00

    Simply modify the code with 0 to 1 make it work

    import numpy as np
    from sklearn.metrics import roc_auc_score
    y_true = np.array([0, 1, 0, 0])
    y_scores = np.array([1, 0, 0, 0])
    roc_auc_score(y_true, y_scores)
    

    I believe the error message has suggested that only one class in y_true (all zero), you need to give 2 classes in y_true.

    0 讨论(0)
  • 2020-12-16 15:07

    You can increase the batch-size from e.g. from 32 to 64, you can use StratifiedKFold or StratifiedShuffleSplit. If the error still occurs, try shuffeling your data e.g. in your DataLoader.

    0 讨论(0)
提交回复
热议问题