I\'m trying to create N balanced random subsamples of my large unbalanced dataset. Is there a way to do this simply with scikit-learn / pandas or do I have to implement it m
I found the best solutions here
And this is the one I think it's the simplest.
dataset = pd.read_csv("data.csv")
X = dataset.iloc[:, 1:12].values
y = dataset.iloc[:, 12].values
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(return_indices=True)
X_rus, y_rus, id_rus = rus.fit_sample(X, y)
then you can use X_rus, y_rus data
For versions 0.4<=:
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler()
X_rus, y_rus= rus.fit_sample(X, y)
Then, indices of the samples randomly selected can be reached by sample_indices_
attribute.
Below is my python implementation for creating balanced data copy. Assumptions: 1. target variable (y) is binary class (0 vs. 1) 2. 1 is the minority.
from numpy import unique
from numpy import random
def balanced_sample_maker(X, y, random_seed=None):
""" return a balanced data set by oversampling minority class
current version is developed on assumption that the positive
class is the minority.
Parameters:
===========
X: {numpy.ndarrray}
y: {numpy.ndarray}
"""
uniq_levels = unique(y)
uniq_counts = {level: sum(y == level) for level in uniq_levels}
if not random_seed is None:
random.seed(random_seed)
# find observation index of each class levels
groupby_levels = {}
for ii, level in enumerate(uniq_levels):
obs_idx = [idx for idx, val in enumerate(y) if val == level]
groupby_levels[level] = obs_idx
# oversampling on observations of positive label
sample_size = uniq_counts[0]
over_sample_idx = random.choice(groupby_levels[1], size=sample_size, replace=True).tolist()
balanced_copy_idx = groupby_levels[0] + over_sample_idx
random.shuffle(balanced_copy_idx)
return X[balanced_copy_idx, :], y[balanced_copy_idx]
There now exists a full-blown python package to address imbalanced data. It is available as a sklearn-contrib package at https://github.com/scikit-learn-contrib/imbalanced-learn
Although it is already answered, I stumbled upon your question looking for something similar. After some more research, I believe sklearn.model_selection.StratifiedKFold
can be used for this purpose:
from sklearn.model_selection import StratifiedKFold
X = samples_array
y = classes_array # subsamples will be stratified according to y
n = desired_number_of_subsamples
skf = StratifiedKFold(n, shuffle = True)
batches = []
for _, batch in skf.split(X, y):
do_something(X[batch], y[batch])
It's important that you add the _
because since skf.split()
is used to create stratified folds for K-fold cross-validation, it returns two lists of indices: train
(n - 1 / n
elements) and test (1 / n
elements).
Please note that this is as of sklearn 0.18. In sklearn 0.17 the same function can be found in module cross_validation
instead.
Here is my solution, which can be tightly integrated in an existing sklearn pipeline:
from sklearn.model_selection import RepeatedKFold
import numpy as np
class DownsampledRepeatedKFold(RepeatedKFold):
def split(self, X, y=None, groups=None):
for i in range(self.n_repeats):
np.random.seed()
# get index of major class (negative)
idxs_class0 = np.argwhere(y == 0).ravel()
# get index of minor class (positive)
idxs_class1 = np.argwhere(y == 1).ravel()
# get length of minor class
len_minor = len(idxs_class1)
# subsample of major class of size minor class
idxs_class0_downsampled = np.random.choice(idxs_class0, size=len_minor)
original_indx_downsampled = np.hstack((idxs_class0_downsampled, idxs_class1))
np.random.shuffle(original_indx_downsampled)
splits = list(self.cv(n_splits=self.n_splits, shuffle=True).split(original_indx_downsampled))
for train_index, test_index in splits:
yield original_indx_downsampled[train_index], original_indx_downsampled[test_index]
def __init__(self, n_splits=5, n_repeats=10, random_state=None):
self.n_splits = n_splits
super(DownsampledRepeatedKFold, self).__init__(
n_splits=n_splits, n_repeats=n_repeats, random_state=random_state
)
Use it as usual:
for train_index, test_index in DownsampledRepeatedKFold(n_splits=5, n_repeats=10).split(X, y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
Simply select 100 rows in each class with duplicates using the following code. activity
is my classes (labels of the dataset)
balanced_df=Pdf_train.groupby('activity',as_index = False,group_keys=False).apply(lambda s: s.sample(100,replace=True))