Scikit-learn balanced subsampling

前端未结

关注

 13  1614

终归单人心 2020-12-02 10:34

I\'m trying to create N balanced random subsamples of my large unbalanced dataset. Is there a way to do this simply with scikit-learn / pandas or do I have to implement it m

13条回答

再見小時候 (楼主)

2020-12-02 11:08

I found the best solutions here

And this is the one I think it's the simplest.

dataset = pd.read_csv("data.csv")
X = dataset.iloc[:, 1:12].values
y = dataset.iloc[:, 12].values

from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(return_indices=True)
X_rus, y_rus, id_rus = rus.fit_sample(X, y)

then you can use X_rus, y_rus data

For versions 0.4<=:

from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler()
X_rus, y_rus= rus.fit_sample(X, y)

Then, indices of the samples randomly selected can be reached by sample_indices_ attribute.

0 讨论(0)

查看其它13个回答