Scikit-learn balanced subsampling

前端 未结 13 1638
终归单人心
终归单人心 2020-12-02 10:34

I\'m trying to create N balanced random subsamples of my large unbalanced dataset. Is there a way to do this simply with scikit-learn / pandas or do I have to implement it m

13条回答
  •  暗喜
    暗喜 (楼主)
    2020-12-02 11:11

    Although it is already answered, I stumbled upon your question looking for something similar. After some more research, I believe sklearn.model_selection.StratifiedKFold can be used for this purpose:

    from sklearn.model_selection import StratifiedKFold
    
    X = samples_array
    y = classes_array # subsamples will be stratified according to y
    n = desired_number_of_subsamples
    
    skf = StratifiedKFold(n, shuffle = True)
    
    batches = []
    for _, batch in skf.split(X, y):
        do_something(X[batch], y[batch])
    

    It's important that you add the _ because since skf.split() is used to create stratified folds for K-fold cross-validation, it returns two lists of indices: train (n - 1 / n elements) and test (1 / n elements).

    Please note that this is as of sklearn 0.18. In sklearn 0.17 the same function can be found in module cross_validation instead.

提交回复
热议问题