I\'m trying to create N balanced random subsamples of my large unbalanced dataset. Is there a way to do this simply with scikit-learn / pandas or do I have to implement it m
Although it is already answered, I stumbled upon your question looking for something similar. After some more research, I believe sklearn.model_selection.StratifiedKFold can be used for this purpose:
from sklearn.model_selection import StratifiedKFold
X = samples_array
y = classes_array # subsamples will be stratified according to y
n = desired_number_of_subsamples
skf = StratifiedKFold(n, shuffle = True)
batches = []
for _, batch in skf.split(X, y):
do_something(X[batch], y[batch])
It's important that you add the _ because since skf.split() is used to create stratified folds for K-fold cross-validation, it returns two lists of indices: train (n - 1 / n elements) and test (1 / n elements).
Please note that this is as of sklearn 0.18. In sklearn 0.17 the same function can be found in module cross_validation instead.