Stratified sampling on factor

问题

I have a dataset of 1000 rows with the following structure:

     device geslacht leeftijd type1 type2
1       mob        0       53     C     3
2       tab        1       64     G     7
3        pc        1       50     G     7
4       tab        0       75     C     3
5       mob        1       54     G     7
6        pc        1       58     H     8
7        pc        1       57     A     1
8        pc        0       68     E     5
9        pc        0       66     G     7
10      mob        0       45     C     3
11      tab        1       77     E     5
12      mob        1       16     A     1

I would like to make a sample of 80 rows, composed of 10 rows with type1 = A, 10 rows with type1 = B, and so on. Is there anyone who can help he?

回答1:

Base R solution:

do.call(rbind,
        lapply(split(df, df$type1), function(i)
          i[sample(1:nrow(i), size = 10, replace = TRUE),]))

EDIT:

回答2:

Here's how I would approach this using data.table

library(data.table)
indx <- setDT(df)[, .I[sample(.N, 10, replace = TRUE)], by = type1]$V1
df[indx]
#     device geslacht leeftijd type1 type2
#  1:    mob        0       45     C     3
#  2:    mob        0       53     C     3
#  3:    tab        0       75     C     3
#  4:    mob        0       53     C     3
#  5:    tab        0       75     C     3
#  6:    mob        0       45     C     3
#  7:    tab        0       75     C     3
#  8:    mob        0       53     C     3
#  9:    mob        0       53     C     3
# 10:    mob        0       53     C     3
# 11:    mob        1       54     G     7
#...

Or a simpler version would be

setDT(df)[, .SD[sample(.N, 10, replace = TRUE)], by = type1]

Basically we are sampling (with replacement- as you have less than 10 rows within each group) from the row indexes within each group of type1 and then subsetting the data by this index

Similarly with dplyr you could do

library(dplyr)
df %>% 
  group_by(type1) %>%
  sample_n(10, replace = TRUE)

回答3:

Another option in base R:

df[as.vector(sapply(unique(df$type1), 
                    function(x){
                        sample(which(df$type1==x), 10, replace=T)
                    })), ]

来源：https://stackoverflow.com/questions/30097382/stratified-sampling-on-factor

标签

dataframe

sampling