Scikit-learn balanced subsampling

前端 未结 13 1587
终归单人心
终归单人心 2020-12-02 10:34

I\'m trying to create N balanced random subsamples of my large unbalanced dataset. Is there a way to do this simply with scikit-learn / pandas or do I have to implement it m

相关标签:
13条回答
  • 2020-12-02 11:08

    I found the best solutions here

    And this is the one I think it's the simplest.

    dataset = pd.read_csv("data.csv")
    X = dataset.iloc[:, 1:12].values
    y = dataset.iloc[:, 12].values
    
    from imblearn.under_sampling import RandomUnderSampler
    rus = RandomUnderSampler(return_indices=True)
    X_rus, y_rus, id_rus = rus.fit_sample(X, y)
    

    then you can use X_rus, y_rus data

    For versions 0.4<=:

    from imblearn.under_sampling import RandomUnderSampler
    rus = RandomUnderSampler()
    X_rus, y_rus= rus.fit_sample(X, y)
    

    Then, indices of the samples randomly selected can be reached by sample_indices_ attribute.

    0 讨论(0)
  • 2020-12-02 11:09

    Below is my python implementation for creating balanced data copy. Assumptions: 1. target variable (y) is binary class (0 vs. 1) 2. 1 is the minority.

    from numpy import unique
    from numpy import random 
    
    def balanced_sample_maker(X, y, random_seed=None):
        """ return a balanced data set by oversampling minority class 
            current version is developed on assumption that the positive
            class is the minority.
    
        Parameters:
        ===========
        X: {numpy.ndarrray}
        y: {numpy.ndarray}
        """
        uniq_levels = unique(y)
        uniq_counts = {level: sum(y == level) for level in uniq_levels}
    
        if not random_seed is None:
            random.seed(random_seed)
    
        # find observation index of each class levels
        groupby_levels = {}
        for ii, level in enumerate(uniq_levels):
            obs_idx = [idx for idx, val in enumerate(y) if val == level]
            groupby_levels[level] = obs_idx
    
        # oversampling on observations of positive label
        sample_size = uniq_counts[0]
        over_sample_idx = random.choice(groupby_levels[1], size=sample_size, replace=True).tolist()
        balanced_copy_idx = groupby_levels[0] + over_sample_idx
        random.shuffle(balanced_copy_idx)
    
        return X[balanced_copy_idx, :], y[balanced_copy_idx]
    
    0 讨论(0)
  • 2020-12-02 11:10

    There now exists a full-blown python package to address imbalanced data. It is available as a sklearn-contrib package at https://github.com/scikit-learn-contrib/imbalanced-learn

    0 讨论(0)
  • 2020-12-02 11:11

    Although it is already answered, I stumbled upon your question looking for something similar. After some more research, I believe sklearn.model_selection.StratifiedKFold can be used for this purpose:

    from sklearn.model_selection import StratifiedKFold
    
    X = samples_array
    y = classes_array # subsamples will be stratified according to y
    n = desired_number_of_subsamples
    
    skf = StratifiedKFold(n, shuffle = True)
    
    batches = []
    for _, batch in skf.split(X, y):
        do_something(X[batch], y[batch])
    

    It's important that you add the _ because since skf.split() is used to create stratified folds for K-fold cross-validation, it returns two lists of indices: train (n - 1 / n elements) and test (1 / n elements).

    Please note that this is as of sklearn 0.18. In sklearn 0.17 the same function can be found in module cross_validation instead.

    0 讨论(0)
  • 2020-12-02 11:18

    Here is my solution, which can be tightly integrated in an existing sklearn pipeline:

    from sklearn.model_selection import RepeatedKFold
    import numpy as np
    
    
    class DownsampledRepeatedKFold(RepeatedKFold):
    
        def split(self, X, y=None, groups=None):
            for i in range(self.n_repeats):
                np.random.seed()
                # get index of major class (negative)
                idxs_class0 = np.argwhere(y == 0).ravel()
                # get index of minor class (positive)
                idxs_class1 = np.argwhere(y == 1).ravel()
                # get length of minor class
                len_minor = len(idxs_class1)
                # subsample of major class of size minor class
                idxs_class0_downsampled = np.random.choice(idxs_class0, size=len_minor)
                original_indx_downsampled = np.hstack((idxs_class0_downsampled, idxs_class1))
                np.random.shuffle(original_indx_downsampled)
                splits = list(self.cv(n_splits=self.n_splits, shuffle=True).split(original_indx_downsampled))
    
                for train_index, test_index in splits:
                    yield original_indx_downsampled[train_index], original_indx_downsampled[test_index]
    
        def __init__(self, n_splits=5, n_repeats=10, random_state=None):
            self.n_splits = n_splits
             super(DownsampledRepeatedKFold, self).__init__(
            n_splits=n_splits, n_repeats=n_repeats, random_state=random_state
        )
    

    Use it as usual:

        for train_index, test_index in DownsampledRepeatedKFold(n_splits=5, n_repeats=10).split(X, y):
             X_train, X_test = X[train_index], X[test_index]
             y_train, y_test = y[train_index], y[test_index]
    
    0 讨论(0)
  • 2020-12-02 11:19

    Simply select 100 rows in each class with duplicates using the following code. activity is my classes (labels of the dataset)

    balanced_df=Pdf_train.groupby('activity',as_index = False,group_keys=False).apply(lambda s: s.sample(100,replace=True))
    
    0 讨论(0)
提交回复
热议问题