Scikit-learn balanced subsampling

前端 未结 13 1599
终归单人心
终归单人心 2020-12-02 10:34

I\'m trying to create N balanced random subsamples of my large unbalanced dataset. Is there a way to do this simply with scikit-learn / pandas or do I have to implement it m

13条回答
  •  眼角桃花
    2020-12-02 10:57

    A short, pythonic solution to balance a pandas DataFrame either by subsampling (uspl=True) or oversampling (uspl=False), balanced by a specified column in that dataframe that has two or more values.

    For uspl=True, this code will take a random sample without replacement of size equal to the smallest stratum from all strata. For uspl=False, this code will take a random sample with replacement of size equal to the largest stratum from all strata.

    def balanced_spl_by(df, lblcol, uspl=True):
        datas_l = [ df[df[lblcol]==l].copy() for l in list(set(df[lblcol].values)) ]
        lsz = [f.shape[0] for f in datas_l ]
        return pd.concat([f.sample(n = (min(lsz) if uspl else max(lsz)), replace = (not uspl)).copy() for f in datas_l ], axis=0 ).sample(frac=1) 
    

    This will only work with a Pandas DataFrame, but that seems to be a common application, and restricting it to Pandas DataFrames significantly shortens the code as far as I can tell.

提交回复
热议问题