Scikit-learn balanced subsampling

前端 未结 13 1636
终归单人心
终归单人心 2020-12-02 10:34

I\'m trying to create N balanced random subsamples of my large unbalanced dataset. Is there a way to do this simply with scikit-learn / pandas or do I have to implement it m

13条回答
  •  旧时难觅i
    2020-12-02 11:09

    Below is my python implementation for creating balanced data copy. Assumptions: 1. target variable (y) is binary class (0 vs. 1) 2. 1 is the minority.

    from numpy import unique
    from numpy import random 
    
    def balanced_sample_maker(X, y, random_seed=None):
        """ return a balanced data set by oversampling minority class 
            current version is developed on assumption that the positive
            class is the minority.
    
        Parameters:
        ===========
        X: {numpy.ndarrray}
        y: {numpy.ndarray}
        """
        uniq_levels = unique(y)
        uniq_counts = {level: sum(y == level) for level in uniq_levels}
    
        if not random_seed is None:
            random.seed(random_seed)
    
        # find observation index of each class levels
        groupby_levels = {}
        for ii, level in enumerate(uniq_levels):
            obs_idx = [idx for idx, val in enumerate(y) if val == level]
            groupby_levels[level] = obs_idx
    
        # oversampling on observations of positive label
        sample_size = uniq_counts[0]
        over_sample_idx = random.choice(groupby_levels[1], size=sample_size, replace=True).tolist()
        balanced_copy_idx = groupby_levels[0] + over_sample_idx
        random.shuffle(balanced_copy_idx)
    
        return X[balanced_copy_idx, :], y[balanced_copy_idx]
    

提交回复
热议问题