I\'m trying to create N balanced random subsamples of my large unbalanced dataset. Is there a way to do this simply with scikit-learn / pandas or do I have to implement it m
A slight modification to the top answer by mikkom.
If you want to preserve ordering of the larger class data, ie. you don't want to shuffle.
Instead of
if len(this_xs) > use_elems:
np.random.shuffle(this_xs)
do this
if len(this_xs) > use_elems:
ratio = len(this_xs) / use_elems
this_xs = this_xs[::ratio]