Take randomly sample based on groups

前端 未结 8 677
说谎
说谎 2020-11-28 13:23

I have a df made by almost 50,000 rows spread in 15 different IDs (every ID has thousands of observations). df looks like:

        ID  Year    Temp    ph
1           


        
8条回答
  •  遥遥无期
    2020-11-28 13:57

    In case you have big datasets, a data.table solution could go like this:

    library(data.table)
    
    # Generate 26 mil rows random data
    set.seed(2019)
    dt <- data.table(c1 = sample(length(LETTERS)*10^6), 
                     c2 = sample(LETTERS, replace = TRUE))
    
    # For each letter, sample 500 rows
    dt_sample <- dt[, .SD[sample(x = .N, size = 500)], by = c2]
    
    # We indeed sampled 500 rows for each letter
    dt_sample[, .N, by = c2][order(c2)]
    #>     c2   N
    #>  1:  A 500
    #>  2:  D 500
    #>  3:  G 500
    #>  4:  I 500
    #>  5:  M 500
    #>  6:  N 500
    #>  7:  O 500
    #>  8:  P 500
    #>  9:  Q 500
    #> 10:  R 500
    #> 11:  S 500
    #> 12:  T 500
    #> 13:  U 500
    #> 14:  V 500
    #> 15:  W 500
    #> 16:  Y 500
    #> 17:  Z 500
    

    Created on 2019-04-23 by the reprex package (v0.2.1)

    In case your data is unbalanced in the sense that some groups happen to be smaller (as number of rows) than your desired sample size, then you need to set a defensive trick like sample size should be min(500, .N) - see sample random rows within each group in a data.table. So like:

    dt[, .SD[sample(x = .N, size = min(500, .N))], by = c2]

提交回复
热议问题