How do I use tidyr to fill in completed rows within each value of a grouping variable?

孤街浪徒 提交于 2019-11-28 03:31:35

问题


Say I have data on people who choose between several options. I have one row per person, and I want to have one row per person and choice option. So, if I have 10 people who have 3 choices, right now I have 10 rows, and I want to have 30.

All of the other variables should be copied to each of the new rows. So, for example, if I have a variable for gender, that should be constant within ID. (I am setting my data up this way to analyze with mnlogit.)

This seems like the situation that two tidyr functions, complete and fill, were designed for. To use a simple example:

library(lubridate)
library(tidyr)
dat <- data.frame(
    id = 1:3,
    choice = 5:7,
    c = c(9, NA, 11),
    d = ymd(NA, "2015-09-30", "2015-09-29")
    )

dat %>% 
  complete(id, choice) %>%
  fill(everything())

# Source: local data frame [9 x 4]
# 
#      id choice     c          d
#   (int)  (int) (dbl)     (time)
# 1     1      5     9       <NA>
# 2     1      6     9       <NA>
# 3     1      7     9       <NA>
# 4     2      5     9       <NA>
# 5     2      6     9 2015-09-30
# 6     2      7     9 2015-09-30
# 7     3      5     9 2015-09-30
# 8     3      6     9 2015-09-30
# 9     3      7    11 2015-09-29

But this has some problems -- the values of d were carried forward correctly, but the values of c from ID 1 replaced the (correct) NA values for ID 2.

I could try a workaround, like replacing all of the missing values with 999, running complete and fill, and then replacing 999 with NA. (I think I would have to convert the date variables to character variables and then convert them back again if I go this route.) But maybe someone on here knows of a tidy way to do this with tidyr?

Edit: the desired output here is:

# Source: local data frame [9 x 4]
# 
#     id     c          d choice
#  (int) (dbl)     (time)  (int)
# 1     1     9       <NA>      5
# 2     1     9       <NA>      6
# 3     1     9       <NA>      7
# 4     2    NA 2015-09-30      5
# 5     2    NA 2015-09-30      6
# 6     2    NA 2015-09-30      7
# 7     3    11 2015-09-29      5
# 8     3    11 2015-09-29      6
# 9     3    11 2015-09-29      7

回答1:


You can use the trick of "grouping" things to complete within complete using c(). This makes it so that it is only completed using preexisting combinations of the grouped variables.

library(tidyr)
dat %>% complete(c(id, c, d), choice) 
     id     c          d choice
  (int) (dbl)     (time)  (int)
1     1     9       <NA>      5
2     1     9       <NA>      6
3     1     9       <NA>      7
4     2    NA 2015-09-30      5
5     2    NA 2015-09-30      6
6     2    NA 2015-09-30      7
7     3    11 2015-09-29      5
8     3    11 2015-09-29      6
9     3    11 2015-09-29      7



回答2:


As an update to @jeremycg answer. From tidyr 0.5.1 (or maybe even version 0.4.0) onwards c() does not work anymore. Use nesting() instead:

dat %>% 
 complete(nesting(id, c, d), choice) 

Note I was trying to edit @jeremycg answer, since the answer was correct at the time it was written (and hence a new answer is not really necessary) but unfortunately the edit got rejected.




回答3:


I think you're better off keeping the data separate while you prepare it, and then merging before you need to do the regression.

subjectdata <- dat[,c("id", "c", "d")]
questiondata <- dat[,c("id", "choice")] %>% complete(id, choice)

And then

> merge(questiondata, subjectdata)
  id choice  c          d
1  1      5  9       <NA>
2  1      6  9       <NA>
3  1      7  9       <NA>
4  2      5 NA 2015-09-30
5  2      6 NA 2015-09-30
6  2      7 NA 2015-09-30
7  3      5 11 2015-09-29
8  3      6 11 2015-09-29
9  3      7 11 2015-09-29

as necessary. That way you also get a valid d column for user 2, without relying on the order of questions in the data frame.




回答4:


It looks like another approach is to use spread and gather. spread creates one column per possible answer, and gather takes the separate columns and reshapes them into rows. With these data:

dat %>%
  spread(choice, choice) %>%
  gather(choice, drop_me, `5`:`7`) %>%  # Drop me is a redundant column
  select(-drop_me) %>%
  arrange(id, choice)  # reorders so that the answer matches

#   id  c          d choice
# 1  1  9       <NA>      5
# 2  1  9       <NA>      6
# 3  1  9       <NA>      7
# 4  2 NA 2015-09-30      5
# 5  2 NA 2015-09-30      6
# 6  2 NA 2015-09-30      7
# 7  3 11 2015-09-29      5
# 8  3 11 2015-09-29      6
# 9  3 11 2015-09-29      7

I haven't done any testing to see how these compare in efficiency.



来源:https://stackoverflow.com/questions/32874239/how-do-i-use-tidyr-to-fill-in-completed-rows-within-each-value-of-a-grouping-var

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!