Scikit-learn balanced subsampling

前端未结

关注

 13  1638

终归单人心 2020-12-02 10:34

I\'m trying to create N balanced random subsamples of my large unbalanced dataset. Is there a way to do this simply with scikit-learn / pandas or do I have to implement it m

13条回答

暗喜 (楼主)

2020-12-02 11:11
Although it is already answered, I stumbled upon your question looking for something similar. After some more research, I believe sklearn.model_selection.StratifiedKFold can be used for this purpose:
```
from sklearn.model_selection import StratifiedKFold

X = samples_array
y = classes_array # subsamples will be stratified according to y
n = desired_number_of_subsamples

skf = StratifiedKFold(n, shuffle = True)

batches = []
for _, batch in skf.split(X, y):
    do_something(X[batch], y[batch])
```
It's important that you add the _ because since skf.split() is used to create stratified folds for K-fold cross-validation, it returns two lists of indices: train (n - 1 / n elements) and test (1 / n elements).

Please note that this is as of sklearn 0.18. In sklearn 0.17 the same function can be found in module cross_validation instead.
0 讨论(0)

查看其它13个回答
发布评论:

提交评论
- 加载中...