I\'m trying to create N balanced random subsamples of my large unbalanced dataset. Is there a way to do this simply with scikit-learn / pandas or do I have to implement it m
I found the best solutions here
And this is the one I think it's the simplest.
dataset = pd.read_csv("data.csv")
X = dataset.iloc[:, 1:12].values
y = dataset.iloc[:, 12].values
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(return_indices=True)
X_rus, y_rus, id_rus = rus.fit_sample(X, y)
then you can use X_rus, y_rus data
For versions 0.4<=:
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler()
X_rus, y_rus= rus.fit_sample(X, y)
Then, indices of the samples randomly selected can be reached by sample_indices_ attribute.