How do I use tidyr to fill in completed rows within each value of a grouping variable?

孤者浪人 提交于 2019-11-29 13:58:07

You can use the trick of "grouping" things to complete within complete using c(). This makes it so that it is only completed using preexisting combinations of the grouped variables.

library(tidyr)
dat %>% complete(c(id, c, d), choice) 
     id     c          d choice
  (int) (dbl)     (time)  (int)
1     1     9       <NA>      5
2     1     9       <NA>      6
3     1     9       <NA>      7
4     2    NA 2015-09-30      5
5     2    NA 2015-09-30      6
6     2    NA 2015-09-30      7
7     3    11 2015-09-29      5
8     3    11 2015-09-29      6
9     3    11 2015-09-29      7

As an update to @jeremycg answer. From tidyr 0.5.1 (or maybe even version 0.4.0) onwards c() does not work anymore. Use nesting() instead:

dat %>% 
 complete(nesting(id, c, d), choice) 

Note I was trying to edit @jeremycg answer, since the answer was correct at the time it was written (and hence a new answer is not really necessary) but unfortunately the edit got rejected.

I think you're better off keeping the data separate while you prepare it, and then merging before you need to do the regression.

subjectdata <- dat[,c("id", "c", "d")]
questiondata <- dat[,c("id", "choice")] %>% complete(id, choice)

And then

> merge(questiondata, subjectdata)
  id choice  c          d
1  1      5  9       <NA>
2  1      6  9       <NA>
3  1      7  9       <NA>
4  2      5 NA 2015-09-30
5  2      6 NA 2015-09-30
6  2      7 NA 2015-09-30
7  3      5 11 2015-09-29
8  3      6 11 2015-09-29
9  3      7 11 2015-09-29

as necessary. That way you also get a valid d column for user 2, without relying on the order of questions in the data frame.

It looks like another approach is to use spread and gather. spread creates one column per possible answer, and gather takes the separate columns and reshapes them into rows. With these data:

dat %>%
  spread(choice, choice) %>%
  gather(choice, drop_me, `5`:`7`) %>%  # Drop me is a redundant column
  select(-drop_me) %>%
  arrange(id, choice)  # reorders so that the answer matches

#   id  c          d choice
# 1  1  9       <NA>      5
# 2  1  9       <NA>      6
# 3  1  9       <NA>      7
# 4  2 NA 2015-09-30      5
# 5  2 NA 2015-09-30      6
# 6  2 NA 2015-09-30      7
# 7  3 11 2015-09-29      5
# 8  3 11 2015-09-29      6
# 9  3 11 2015-09-29      7

I haven't done any testing to see how these compare in efficiency.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!