R - sample and resample a person-period file

问题

I am working with a gigantic person-period file and I thought that a good way to deal with a large dataset is by using sampling and re-sampling technique.

My person-period file look like this

   id code time
1   1    a    1
2   1    a    2
3   1    a    3
4   2    b    1
5   2    c    2
6   2    b    3
7   3    c    1
8   3    c    2
9   3    c    3
10  4    c    1
11  4    a    2
12  4    c    3
13  5    a    1
14  5    c    2
15  5    a    3

I have actually two distinct issues.

The first issue is that I am having trouble in simply sampling a person-period file.

For example, I would like to sample 2 id-sequences such as :

  id code time
   1    a    1
   1    a    2
   1    a    3
   2    b    1
   2    c    2
   2    b    3

The following line of code is working for sampling a person-period file

dt[which(dt$id %in% sample(dt$id, 2)), ]

However, I would like to use a dplyr solution because I am interested in resampling and in particular I would like to use replicate.

I am interested in doing something like replicate(100, sample_n(dt, 2), simplify = FALSE)

I am struggling with the dplyr solution because I am not sure what should be the grouping variable.

library(dplyr)
dt %>% group_by(id) %>% sample_n(1)

gives me an incorrect result because it does not keep the full sequence of each id.

Any clue how I could both sample and re-sample person-period file ?

data

dt = structure(list(id = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 
3L, 4L, 4L, 4L, 5L, 5L, 5L), .Label = c("1", "2", "3", "4", "5"
), class = "factor"), code = structure(c(1L, 1L, 1L, 2L, 3L, 
2L, 3L, 3L, 3L, 3L, 1L, 3L, 1L, 3L, 1L), .Label = c("a", "b", 
"c"), class = "factor"), time = structure(c(1L, 2L, 3L, 1L, 2L, 
3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("1", "2", 
"3"), class = "factor")), .Names = c("id", "code", "time"), row.names = c(NA, 
-15L), class = "data.frame")

回答1:

I think the idiomatic way would probably look like

set.seed(1)
samp = df %>% select(id) %>% distinct %>% sample_n(2)
left_join(samp, df)

  id code time
1  2    b    1
2  2    c    2
3  2    b    3
4  5    a    1
5  5    c    2
6  5    a    3

This extends straightforwardly to more grouping variables and fancier sampling rules.

If you need to do this many times...

nrep = 100
ng   = 2
samps = df %>% select(id) %>% distinct %>% 
  slice(rep(1:n(), nrep)) %>% mutate(r = rep(1:nrep, each = n()/nrep)) %>%
  group_by(r) %>% sample_n(ng)
repdat = left_join(samps, df)

# then do stuff with it:
repdat %>% group_by(r) %>% do_stuff

回答2:

I imagine you are doing some simulations and may want to do the subsetting many times. You probably also want to try this data.table method and utilize the fast binary search feature on the key column:

library(data.table)
setDT(dt)
setkey(dt, id)
replicate(2, dt[list(sample(unique(id), 2))], simplify = F)

#[[1]]
#   id code time
#1:  3    c    1
#2:  3    c    2
#3:  3    c    3
#4:  5    a    1
#5:  5    c    2
#6:  5    a    3

#[[2]]
#   id code time
#1:  3    c    1
#2:  3    c    2
#3:  3    c    3
#4:  4    c    1
#5:  4    a    2
#6:  4    c    3

回答3:

We can use filter with sample

dt %>%
    filter(id %in% sample(unique(id),2, replace = FALSE))

NOTE: The OP specified using dplyr method and this solution does uses the dplyr.

If we need to do replicate one option would be using map from purrr

library(purrr)
dt %>% 
    distinct(id) %>% 
    replicate(2, .) %>%
    map(~sample(., 2, replace=FALSE)) %>%
    map(~filter(dt, id %in% .))
#$id
#  id code time
#1  1    a    1
#2  1    a    2
#3  1    a    3
#4  4    c    1
#5  4    a    2
#6  4    c    3

#$id
#  id code time
#1  4    c    1
#2  4    a    2
#3  4    c    3
#4  5    a    1
#5  5    c    2
#6  5    a    3

来源：https://stackoverflow.com/questions/38878720/r-sample-and-resample-a-person-period-file

标签

dplyr

sample