Stratified Sampling in Pandas

后端 未结 3 1595
慢半拍i
慢半拍i 2020-12-12 22:17

I\'ve looked at the Sklearn stratified sampling docs as well as the pandas docs and also Stratified samples from Pandas and sklearn stratified sampling based on a column but

3条回答
  •  隐瞒了意图╮
    2020-12-12 23:13

    Extending the groupby answer, we can make sure that sample is balanced. To do so, when for all classes the number of samples is >= n_samples, we can just take n_samples for all classes (previous answer). When minority class contains < n_samples, we can take the number of samples for all classes to be the same as of minority class.

    def stratified_sample_df(df, col, n_samples):
        n = min(n_samples, df[col].value_counts().min())
        df_ = df.groupby(col).apply(lambda x: x.sample(n))
        df_.index = df_.index.droplevel(0)
        return df_
    

提交回复
热议问题