Add rows in data frame if observations are missing [duplicate]

问题

I have a df1 with multiple questionnaires (measure) per persons (id) which were answered at particular points in time (date). Normally every person should fill out three questionnaires per session (first, pre, post). Some participants fail to fill out all three questionnaires. They might only answer one or two of the three. Hence, the possible patterns could be complete (participant A), missing “post” (Participant B), missing “first” (participant C), missing “pre” (participant D), or only having answered one of the three (participant E, F, G).

See df1:

df1 <- structure(list(id = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 4L,  4L, 5L, 6L, 7L), .Label = c("A", "B", "C", "D", "E", "F", "G"), class = "factor"), measure = structure(c(1L, 3L, 2L, 1L, 3L, 3L, 2L, 1L, 2L, 1L, 3L, 2L), .Label = c("first", "post", "pre"), class = "factor"), date = structure(c(17558, 17558, 17558,  17558, 17559, 17559, 17559, 17559, 17558, 17558, 17558, 17558 ), class = "Date"), result = c(1, 5, 4, 7, 8, 7, 2, 1, 3, 5, 7, 7)), class = "data.frame", row.names = c(NA, -12L))

Now, I would like to add missing rows in the dataset with id and measure as well as “NA” for missing date and result. The final df should look like df2.

df2 <- structure(list(id = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L, 7L, 7L, 7L), .Label = c("A", "B", "C", "D", "E", "F", "G"), class = "factor"), measure = structure(c(1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L), .Label = c("first", "post", "pre"), class = "factor"), date = structure(c(17558, 17558, 17558, 17558, 17559, NA, NA, 17559, 17559, 17559, NA, 17558, 17558, NA, NA, NA, 17558, NA, NA, NA, 17558), class = "Date"), result = c(1, 5, 4, 7, 8, NA, NA, 7, 2, 1, NA, 3, 5, NA, NA, NA, 7, NA, NA, NA, 7)), class = "data.frame", row.names = c(NA, -21L))

I tried to group_by the combinations which could be missing and insert a row. But this did not lead to the desired result.

require (tidyverse)
final <- df1 %>%
group_by(id, measure == "first" & lag(measure, 1, default=NA) == "post") %>%
do(add_row(., measure = "pre", .after = 0)) %>%
ungroup()

I also tried

final <- df1 %>% complete(id, nesting(measure, date))

What, perhaps, makes it even more complicated is that participants could take part in more than one session. Hence, there is the possibility that each id has x * (first, post, pre).

回答1:

Should simply be accomplished by complete(df1, id, measure). Try this:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)

df1 <- structure(list(
  id = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 4L,  4L, 5L, 6L, 7L), 
                 .Label = c("A", "B", "C", "D", "E", "F", "G"), 
                 class = "factor"), 
  measure = structure(c(1L, 3L, 2L, 1L, 3L, 3L, 2L, 1L, 2L, 1L, 3L, 2L), 
                      .Label = c("first", "post", "pre"), 
                      class = "factor"), 
  date = structure(c(17558, 17558, 17558,  17558, 17559, 17559, 17559, 17559, 17558, 17558, 17558, 17558 ), class = "Date"), 
  result = c(1, 5, 4, 7, 8, 7, 2, 1, 3, 5, 7, 7)), class = "data.frame", row.names = c(NA, -12L))

df2 <- structure(list(id = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L, 7L, 7L, 7L), .Label = c("A", "B", "C", "D", "E", "F", "G"), class = "factor"), measure = structure(c(1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L), .Label = c("first", "post", "pre"), class = "factor"), date = structure(c(17558, 17558, 17558, 17558, 17559, NA, NA, 17559, 17559, 17559, NA, 17558, 17558, NA, NA, NA, 17558, NA, NA, NA, 17558), class = "Date"), result = c(1, 5, 4, 7, 8, NA, NA, 7, 2, 1, NA, 3, 5, NA, NA, NA, 7, NA, NA, NA, 7)), class = "data.frame", row.names = c(NA, -21L))

# Result with complete(df1, id, measure) and setting order of measure
complete(df1, id, measure) %>% 
  mutate(measure = factor(measure, levels = c("first", "pre", "post"))) %>% 
  arrange(id, measure, date) %>% 
  as.data.frame()
#>    id measure       date result
#> 1   A   first 2018-01-27      1
#> 2   A     pre 2018-01-27      5
#> 3   A    post 2018-01-27      4
#> 4   B   first 2018-01-27      7
#> 5   B     pre 2018-01-28      8
#> 6   B    post       <NA>     NA
#> 7   C   first       <NA>     NA
#> 8   C     pre 2018-01-28      7
#> 9   C    post 2018-01-28      2
#> 10  D   first 2018-01-28      1
#> 11  D     pre       <NA>     NA
#> 12  D    post 2018-01-27      3
#> 13  E   first 2018-01-27      5
#> 14  E     pre       <NA>     NA
#> 15  E    post       <NA>     NA
#> 16  F   first       <NA>     NA
#> 17  F     pre 2018-01-27      7
#> 18  F    post       <NA>     NA
#> 19  G   first       <NA>     NA
#> 20  G     pre       <NA>     NA
#> 21  G    post 2018-01-27      7

# Desired output
df2 %>% 
  mutate(measure = factor(measure, levels = c("first", "pre", "post"))) %>% 
  arrange(id, measure, date)
#>    id measure       date result
#> 1   A   first 2018-01-27      1
#> 2   A     pre 2018-01-27      5
#> 3   A    post 2018-01-27      4
#> 4   B   first 2018-01-27      7
#> 5   B     pre 2018-01-28      8
#> 6   B    post       <NA>     NA
#> 7   C   first       <NA>     NA
#> 8   C     pre 2018-01-28      7
#> 9   C    post 2018-01-28      2
#> 10  D   first 2018-01-28      1
#> 11  D     pre       <NA>     NA
#> 12  D    post 2018-01-27      3
#> 13  E   first 2018-01-27      5
#> 14  E     pre       <NA>     NA
#> 15  E    post       <NA>     NA
#> 16  F   first       <NA>     NA
#> 17  F     pre 2018-01-27      7
#> 18  F    post       <NA>     NA
#> 19  G   first       <NA>     NA
#> 20  G     pre       <NA>     NA
#> 21  G    post 2018-01-27      7

^{Created on 2020-03-09 by the reprex package (v0.3.0)}

来源：https://stackoverflow.com/questions/60592894/add-rows-in-data-frame-if-observations-are-missing

标签

tidyverse