Stratified Sampling in Pandas

后端 未结 3 1601
慢半拍i
慢半拍i 2020-12-12 22:17

I\'ve looked at the Sklearn stratified sampling docs as well as the pandas docs and also Stratified samples from Pandas and sklearn stratified sampling based on a column but

3条回答
  •  夕颜
    夕颜 (楼主)
    2020-12-12 23:16

    the following sample a total of N row where each group appear in its original proportion to the nearest integer, then shuffle and reset the index using:

    df = pd.DataFrame(dict(
        A=[1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4],
        B=range(20)
    ))
    

    Short and sweet:

    df.sample(n=N, weights='A', random_state=1).reset_index(drop=True)
    

    Long version

    df.groupby('A', group_keys=False).apply(lambda x: x.sample(int(np.rint(N*len(x)/len(df))))).sample(frac=1).reset_index(drop=True)
    

提交回复
热议问题