I\'m trying to create N balanced random subsamples of my large unbalanced dataset. Is there a way to do this simply with scikit-learn / pandas or do I have to implement it m
Below is my python implementation for creating balanced data copy. Assumptions: 1. target variable (y) is binary class (0 vs. 1) 2. 1 is the minority.
from numpy import unique
from numpy import random
def balanced_sample_maker(X, y, random_seed=None):
""" return a balanced data set by oversampling minority class
current version is developed on assumption that the positive
class is the minority.
Parameters:
===========
X: {numpy.ndarrray}
y: {numpy.ndarray}
"""
uniq_levels = unique(y)
uniq_counts = {level: sum(y == level) for level in uniq_levels}
if not random_seed is None:
random.seed(random_seed)
# find observation index of each class levels
groupby_levels = {}
for ii, level in enumerate(uniq_levels):
obs_idx = [idx for idx, val in enumerate(y) if val == level]
groupby_levels[level] = obs_idx
# oversampling on observations of positive label
sample_size = uniq_counts[0]
over_sample_idx = random.choice(groupby_levels[1], size=sample_size, replace=True).tolist()
balanced_copy_idx = groupby_levels[0] + over_sample_idx
random.shuffle(balanced_copy_idx)
return X[balanced_copy_idx, :], y[balanced_copy_idx]