Stratified sampling on factor

点点圈 提交于 2019-12-10 13:27:55

问题


I have a dataset of 1000 rows with the following structure:

     device geslacht leeftijd type1 type2
1       mob        0       53     C     3
2       tab        1       64     G     7
3        pc        1       50     G     7
4       tab        0       75     C     3
5       mob        1       54     G     7
6        pc        1       58     H     8
7        pc        1       57     A     1
8        pc        0       68     E     5
9        pc        0       66     G     7
10      mob        0       45     C     3
11      tab        1       77     E     5
12      mob        1       16     A     1

I would like to make a sample of 80 rows, composed of 10 rows with type1 = A, 10 rows with type1 = B, and so on. Is there anyone who can help he?


回答1:


Base R solution:

do.call(rbind,
        lapply(split(df, df$type1), function(i)
          i[sample(1:nrow(i), size = 10, replace = TRUE),]))

EDIT:

Other solutions suggested by @BrodieG

with(DF, DF[unlist(lapply(split(seq(type), type), sample, 10, TRUE)), ])

with(DF, DF[c(sapply(split(seq(type), type), sample, 10, TRUE)), ])



回答2:


Here's how I would approach this using data.table

library(data.table)
indx <- setDT(df)[, .I[sample(.N, 10, replace = TRUE)], by = type1]$V1
df[indx]
#     device geslacht leeftijd type1 type2
#  1:    mob        0       45     C     3
#  2:    mob        0       53     C     3
#  3:    tab        0       75     C     3
#  4:    mob        0       53     C     3
#  5:    tab        0       75     C     3
#  6:    mob        0       45     C     3
#  7:    tab        0       75     C     3
#  8:    mob        0       53     C     3
#  9:    mob        0       53     C     3
# 10:    mob        0       53     C     3
# 11:    mob        1       54     G     7
#...

Or a simpler version would be

setDT(df)[, .SD[sample(.N, 10, replace = TRUE)], by = type1]

Basically we are sampling (with replacement- as you have less than 10 rows within each group) from the row indexes within each group of type1 and then subsetting the data by this index


Similarly with dplyr you could do

library(dplyr)
df %>% 
  group_by(type1) %>%
  sample_n(10, replace = TRUE)



回答3:


Another option in base R:

df[as.vector(sapply(unique(df$type1), 
                    function(x){
                        sample(which(df$type1==x), 10, replace=T)
                    })), ]


来源:https://stackoverflow.com/questions/30097382/stratified-sampling-on-factor

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!