Grouping a data frame by dates: resolve missing time periods' bug

问题

I've identified, if not myself created, a difficult bug to resolve in some nice code received from a generous respondent here on StackOverflow a few weeks ago, and I could use some new assistance today.

Sample data (called object eh below):

    ID        2013-03-20 2013-04-09 2013-04-11 2013-04-17 2013-04-25 2013-05-15 2013-05-24 2013-05-25 2013-05-26
    5167f          0          0          0          0          0          0          0          0          0
    1214m          0          0          0          0          0          0          0          0          0
    1844f          0          0          0          0          0          0          0          0          0
    2113m          0          0          0          0          0          0          0          0          0
    2254m          0          0          0          0          0          0          0          0          0
    2721f          0          0          0          0          0          0          0          0          0
    3121f          0          0          0          0          0          0          0          0          0
    3486f          0          0          0          0          0          0          0          0          0
    3540f          0          0          0          0          0          0          0          0          0
    4175m          0          0          0          0          0          0          0          0          0

I need to be able to group 0s and 1s by the time period in which their respective column date falls (e.g., every 1, 2, 3, or 4 weeks). Whenever a 1 falls at least once within a specific date range (Period), then a 1 is summarized for that ID in that Period (0, else).

I'm starting with the 1-week summary routine as an example. My main problem is that the final output generated lacks some of the total possible 1-week Periods during the time series "2013-03-20" to "2015-12-31".

Notice in this example output, wherein the rows are for unique IDs and columns are for unique Periods, how Periods 2, 5, 7, and 9 are missing:

    1   3   4   6   8   10  11  12  13  14
    0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0

Here is the full routine for grouping the original data frame (see sample data shared above):

    #Convert to data table from original data frame, eh
    dt <- as.data.table(eh)

    #One week summarized encounter histories
    dt_merge <- data_frame(
      # Create a column showing the beginning date
      Date1 = seq(from = ymd("2013-03-20"), to = ymd("2015-12-31"), by = "1 week")) %>%
      # Create  a column showing the end date of each period
      mutate(Date2 = lead(Date1)) %>%
      # Adjust Date1
      mutate(Date1 = if_else(Date1 == ymd("2013-03-20"), Date1, Date1 + 1)) %>%
      # Remove the last row
      drop_na(Date2) %>%
      # Create date list
      mutate(Dates = map2(Date1, Date2, function(x, y){ seq(x, y, by = "day") })) %>%
      unnest() %>%
      # Create Group ID
      mutate(RunID = group_indices_(., dots. = c("Date1", "Date2"))) %>%
      # Create Period ID
      mutate(Period = paste0(RunID)) %>%
      # Add a column showing Month
      mutate(Month = month(Dates)) %>%
      # Add a column showing Year
      mutate(Year = year(Dates)) %>%
      # Add a column showing season
      mutate(Season = case_when(
        Month %in% 3:5            ~ "Spring",
        Month %in% 6:8            ~ "Summer",
        Month %in% 9:11           ~ "Fall",
        Month %in% c(12, 1, 2)    ~ "Winter",
        TRUE                      ~ NA_character_
      )) %>%
      # Combine Season and Year
      mutate(SeasonYear = paste0(Season, Year)) %>%
      select(-Date1, -Date2, -RunID)
    dt2 <- dt %>%
      # Reshape the data frame
      gather(Date, Value, -ID) %>%
      # Convert Date to date class
      mutate(Date = ymd(Date)) %>%
      # Join dt_merge
      left_join(dt_merge, by = c("Date" = "Dates")) 
    one.week <- dt2 %>%
      group_by(ID, Period) %>%
      summarise(Value = max(Value)) %>%
      spread(Period, Value)

    #Finished product
    one.week <- as.data.frame(one.week)

    #Missing weeks 2, 5, 7, and 9...
    one.week

Can someone help me understand where I've gone wrong? Thanks in advance!

-AD

回答1:

This is happening because those weeks are missing from the eh data. For example, if you look at the dates that make up week 2:

dt_merge %>%
  filter(Period == 2)
#> # A tibble: 7 x 6
#>        Dates Period Month  Year Season SeasonYear
#>       <date>  <chr> <dbl> <dbl>  <chr>      <chr>
#> 1 2013-03-28      2     3  2013 Spring Spring2013
#> 2 2013-03-29      2     3  2013 Spring Spring2013
#> 3 2013-03-30      2     3  2013 Spring Spring2013
#> 4 2013-03-31      2     3  2013 Spring Spring2013
#> 5 2013-04-01      2     4  2013 Spring Spring2013
#> 6 2013-04-02      2     4  2013 Spring Spring2013
#> 7 2013-04-03      2     4  2013 Spring Spring2013

You can see that none of these dates are in the columns of eh, which skip from 2013-03-20 to 2013-04-09. Because you use a left_join when creating dt2, only dates (and therefore weeks) in eh are retained.

This can be corrected by using complete() from the tidyr package to create the missing combinations of ID and Date.

dt2 <- dt %>%
  # Reshape the data frame
  gather(Date, Value, -ID) %>%
  # Convert Date to date class
  mutate(Date = ymd(Date)) %>%
  # Create missing ID/Date combinations
  complete(ID, Date = dt_merge$Dates) %>%
  # Join dt_merge
  left_join(dt_merge, by = c("Date" = "Dates"))
one.week <- dt2 %>%
  mutate(Period = as.numeric(Period)) %>%
  group_by(ID, Period) %>%
  summarise(Value = max(Value, na.rm = TRUE)) %>%
  spread(Period, Value)
one.week
#> # A tibble: 10 x 146
#> # Groups:   ID [10]
#>       ID   `1`   `2`   `3`   `4`   `5`   `6`   `7`   `8`   `9`  `10`  `11`
#>  * <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 1214m     0  -Inf     0     0  -Inf     0  -Inf     0  -Inf     0  -Inf
#>  2 1844f     0  -Inf     0     0  -Inf     0  -Inf     0  -Inf     0  -Inf
#>  3 2113m     0  -Inf     0     0  -Inf     0  -Inf     0  -Inf     0  -Inf
#>  4 2254m     0  -Inf     0     0  -Inf     0  -Inf     0  -Inf     0  -Inf
#>  5 2721f     0  -Inf     0     0  -Inf     0  -Inf     0  -Inf     0  -Inf
#>  6 3121f     0  -Inf     0     0  -Inf     0  -Inf     0  -Inf     0  -Inf
#>  7 3486f     0  -Inf     0     0  -Inf     0  -Inf     0  -Inf     0  -Inf
#>  8 3540f     0  -Inf     0     0  -Inf     0  -Inf     0  -Inf     0  -Inf
#>  9 4175m     0  -Inf     0     0  -Inf     0  -Inf     0  -Inf     0  -Inf
#> 10 5167f     0  -Inf     0     0  -Inf     0  -Inf     0  -Inf     0  -Inf
#> # ... with 134 more variables: `12` <dbl>, `13` <dbl>, `14` <dbl>,
#> #   `15` <dbl>, `16` <dbl>, `17` <dbl>, `18` <dbl>, `19` <dbl>,
#> #   `20` <dbl>, `21` <dbl>, `22` <dbl>, `23` <dbl>, `24` <dbl>,
#> #   `25` <dbl>, `26` <dbl>, `27` <dbl>, `28` <dbl>, `29` <dbl>,
#> #   `30` <dbl>, `31` <dbl>, `32` <dbl>, `33` <dbl>, `34` <dbl>,
#> #   `35` <dbl>, `36` <dbl>, `37` <dbl>, `38` <dbl>, `39` <dbl>,
#> #   `40` <dbl>, `41` <dbl>, `42` <dbl>, `43` <dbl>, `44` <dbl>,
#> #   `45` <dbl>, `46` <dbl>, `47` <dbl>, `48` <dbl>, `49` <dbl>,
#> #   `50` <dbl>, `51` <dbl>, `52` <dbl>, `53` <dbl>, `54` <dbl>,
#> #   `55` <dbl>, `56` <dbl>, `57` <dbl>, `58` <dbl>, `59` <dbl>,
#> #   `60` <dbl>, `61` <dbl>, `62` <dbl>, `63` <dbl>, `64` <dbl>,
#> #   `65` <dbl>, `66` <dbl>, `67` <dbl>, `68` <dbl>, `69` <dbl>,
#> #   `70` <dbl>, `71` <dbl>, `72` <dbl>, `73` <dbl>, `74` <dbl>,
#> #   `75` <dbl>, `76` <dbl>, `77` <dbl>, `78` <dbl>, `79` <dbl>,
#> #   `80` <dbl>, `81` <dbl>, `82` <dbl>, `83` <dbl>, `84` <dbl>,
#> #   `85` <dbl>, `86` <dbl>, `87` <dbl>, `88` <dbl>, `89` <dbl>,
#> #   `90` <dbl>, `91` <dbl>, `92` <dbl>, `93` <dbl>, `94` <dbl>,
#> #   `95` <dbl>, `96` <dbl>, `97` <dbl>, `98` <dbl>, `99` <dbl>,
#> #   `100` <dbl>, `101` <dbl>, `102` <dbl>, `103` <dbl>, `104` <dbl>,
#> #   `105` <dbl>, `106` <dbl>, `107` <dbl>, `108` <dbl>, `109` <dbl>,
#> #   `110` <dbl>, `111` <dbl>, ...

Here -Inf is returned if there were no values for that ID in a given week. Alternatively, instead of filling the missing values with NA, they could be filled with, for example 0, using complete(ID, Date = dt_merge$Dates, fill = list(Value = 0)). This will make the Value variable 0 for any of the unobserved ID and Date combinations.

来源：https://stackoverflow.com/questions/45983806/grouping-a-data-frame-by-dates-resolve-missing-time-periods-bug

标签

dataframe

dplyr

lubridate

tidyverse