问题
I've identified, if not myself created, a difficult bug to resolve in some nice code received from a generous respondent here on StackOverflow a few weeks ago, and I could use some new assistance today.
Sample data (called object eh below):
ID 2013-03-20 2013-04-09 2013-04-11 2013-04-17 2013-04-25 2013-05-15 2013-05-24 2013-05-25 2013-05-26
5167f 0 0 0 0 0 0 0 0 0
1214m 0 0 0 0 0 0 0 0 0
1844f 0 0 0 0 0 0 0 0 0
2113m 0 0 0 0 0 0 0 0 0
2254m 0 0 0 0 0 0 0 0 0
2721f 0 0 0 0 0 0 0 0 0
3121f 0 0 0 0 0 0 0 0 0
3486f 0 0 0 0 0 0 0 0 0
3540f 0 0 0 0 0 0 0 0 0
4175m 0 0 0 0 0 0 0 0 0
I need to be able to group 0s and 1s by the time period in which their respective column date falls (e.g., every 1, 2, 3, or 4 weeks). Whenever a 1 falls at least once within a specific date range (Period), then a 1 is summarized for that ID in that Period (0, else).
I'm starting with the 1-week summary routine as an example. My main problem is that the final output generated lacks some of the total possible 1-week Periods during the time series "2013-03-20" to "2015-12-31".
Notice in this example output, wherein the rows are for unique IDs and columns are for unique Periods, how Periods 2, 5, 7, and 9 are missing:
1 3 4 6 8 10 11 12 13 14
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
Here is the full routine for grouping the original data frame (see sample data shared above):
#Convert to data table from original data frame, eh
dt <- as.data.table(eh)
#One week summarized encounter histories
dt_merge <- data_frame(
# Create a column showing the beginning date
Date1 = seq(from = ymd("2013-03-20"), to = ymd("2015-12-31"), by = "1 week")) %>%
# Create a column showing the end date of each period
mutate(Date2 = lead(Date1)) %>%
# Adjust Date1
mutate(Date1 = if_else(Date1 == ymd("2013-03-20"), Date1, Date1 + 1)) %>%
# Remove the last row
drop_na(Date2) %>%
# Create date list
mutate(Dates = map2(Date1, Date2, function(x, y){ seq(x, y, by = "day") })) %>%
unnest() %>%
# Create Group ID
mutate(RunID = group_indices_(., dots. = c("Date1", "Date2"))) %>%
# Create Period ID
mutate(Period = paste0(RunID)) %>%
# Add a column showing Month
mutate(Month = month(Dates)) %>%
# Add a column showing Year
mutate(Year = year(Dates)) %>%
# Add a column showing season
mutate(Season = case_when(
Month %in% 3:5 ~ "Spring",
Month %in% 6:8 ~ "Summer",
Month %in% 9:11 ~ "Fall",
Month %in% c(12, 1, 2) ~ "Winter",
TRUE ~ NA_character_
)) %>%
# Combine Season and Year
mutate(SeasonYear = paste0(Season, Year)) %>%
select(-Date1, -Date2, -RunID)
dt2 <- dt %>%
# Reshape the data frame
gather(Date, Value, -ID) %>%
# Convert Date to date class
mutate(Date = ymd(Date)) %>%
# Join dt_merge
left_join(dt_merge, by = c("Date" = "Dates"))
one.week <- dt2 %>%
group_by(ID, Period) %>%
summarise(Value = max(Value)) %>%
spread(Period, Value)
#Finished product
one.week <- as.data.frame(one.week)
#Missing weeks 2, 5, 7, and 9...
one.week
Can someone help me understand where I've gone wrong? Thanks in advance!
-AD
回答1:
This is happening because those weeks are missing from the eh data. For example, if you look at the dates that make up week 2:
dt_merge %>%
filter(Period == 2)
#> # A tibble: 7 x 6
#> Dates Period Month Year Season SeasonYear
#> <date> <chr> <dbl> <dbl> <chr> <chr>
#> 1 2013-03-28 2 3 2013 Spring Spring2013
#> 2 2013-03-29 2 3 2013 Spring Spring2013
#> 3 2013-03-30 2 3 2013 Spring Spring2013
#> 4 2013-03-31 2 3 2013 Spring Spring2013
#> 5 2013-04-01 2 4 2013 Spring Spring2013
#> 6 2013-04-02 2 4 2013 Spring Spring2013
#> 7 2013-04-03 2 4 2013 Spring Spring2013
You can see that none of these dates are in the columns of eh, which skip from 2013-03-20 to 2013-04-09. Because you use a left_join when creating dt2, only dates (and therefore weeks) in eh are retained.
This can be corrected by using complete() from the tidyr package to create the missing combinations of ID and Date.
dt2 <- dt %>%
# Reshape the data frame
gather(Date, Value, -ID) %>%
# Convert Date to date class
mutate(Date = ymd(Date)) %>%
# Create missing ID/Date combinations
complete(ID, Date = dt_merge$Dates) %>%
# Join dt_merge
left_join(dt_merge, by = c("Date" = "Dates"))
one.week <- dt2 %>%
mutate(Period = as.numeric(Period)) %>%
group_by(ID, Period) %>%
summarise(Value = max(Value, na.rm = TRUE)) %>%
spread(Period, Value)
one.week
#> # A tibble: 10 x 146
#> # Groups: ID [10]
#> ID `1` `2` `3` `4` `5` `6` `7` `8` `9` `10` `11`
#> * <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1214m 0 -Inf 0 0 -Inf 0 -Inf 0 -Inf 0 -Inf
#> 2 1844f 0 -Inf 0 0 -Inf 0 -Inf 0 -Inf 0 -Inf
#> 3 2113m 0 -Inf 0 0 -Inf 0 -Inf 0 -Inf 0 -Inf
#> 4 2254m 0 -Inf 0 0 -Inf 0 -Inf 0 -Inf 0 -Inf
#> 5 2721f 0 -Inf 0 0 -Inf 0 -Inf 0 -Inf 0 -Inf
#> 6 3121f 0 -Inf 0 0 -Inf 0 -Inf 0 -Inf 0 -Inf
#> 7 3486f 0 -Inf 0 0 -Inf 0 -Inf 0 -Inf 0 -Inf
#> 8 3540f 0 -Inf 0 0 -Inf 0 -Inf 0 -Inf 0 -Inf
#> 9 4175m 0 -Inf 0 0 -Inf 0 -Inf 0 -Inf 0 -Inf
#> 10 5167f 0 -Inf 0 0 -Inf 0 -Inf 0 -Inf 0 -Inf
#> # ... with 134 more variables: `12` <dbl>, `13` <dbl>, `14` <dbl>,
#> # `15` <dbl>, `16` <dbl>, `17` <dbl>, `18` <dbl>, `19` <dbl>,
#> # `20` <dbl>, `21` <dbl>, `22` <dbl>, `23` <dbl>, `24` <dbl>,
#> # `25` <dbl>, `26` <dbl>, `27` <dbl>, `28` <dbl>, `29` <dbl>,
#> # `30` <dbl>, `31` <dbl>, `32` <dbl>, `33` <dbl>, `34` <dbl>,
#> # `35` <dbl>, `36` <dbl>, `37` <dbl>, `38` <dbl>, `39` <dbl>,
#> # `40` <dbl>, `41` <dbl>, `42` <dbl>, `43` <dbl>, `44` <dbl>,
#> # `45` <dbl>, `46` <dbl>, `47` <dbl>, `48` <dbl>, `49` <dbl>,
#> # `50` <dbl>, `51` <dbl>, `52` <dbl>, `53` <dbl>, `54` <dbl>,
#> # `55` <dbl>, `56` <dbl>, `57` <dbl>, `58` <dbl>, `59` <dbl>,
#> # `60` <dbl>, `61` <dbl>, `62` <dbl>, `63` <dbl>, `64` <dbl>,
#> # `65` <dbl>, `66` <dbl>, `67` <dbl>, `68` <dbl>, `69` <dbl>,
#> # `70` <dbl>, `71` <dbl>, `72` <dbl>, `73` <dbl>, `74` <dbl>,
#> # `75` <dbl>, `76` <dbl>, `77` <dbl>, `78` <dbl>, `79` <dbl>,
#> # `80` <dbl>, `81` <dbl>, `82` <dbl>, `83` <dbl>, `84` <dbl>,
#> # `85` <dbl>, `86` <dbl>, `87` <dbl>, `88` <dbl>, `89` <dbl>,
#> # `90` <dbl>, `91` <dbl>, `92` <dbl>, `93` <dbl>, `94` <dbl>,
#> # `95` <dbl>, `96` <dbl>, `97` <dbl>, `98` <dbl>, `99` <dbl>,
#> # `100` <dbl>, `101` <dbl>, `102` <dbl>, `103` <dbl>, `104` <dbl>,
#> # `105` <dbl>, `106` <dbl>, `107` <dbl>, `108` <dbl>, `109` <dbl>,
#> # `110` <dbl>, `111` <dbl>, ...
Here -Inf is returned if there were no values for that ID in a given week. Alternatively, instead of filling the missing values with NA, they could be filled with, for example 0, using complete(ID, Date = dt_merge$Dates, fill = list(Value = 0)). This will make the Value variable 0 for any of the unobserved ID and Date combinations.
来源:https://stackoverflow.com/questions/45983806/grouping-a-data-frame-by-dates-resolve-missing-time-periods-bug