How do I remove rows based on a range of dates given by values in 2 columns?

匆匆过客 提交于 2021-02-19 05:39:26

问题


I have a data set that includes a range of dates and need to fill in the missing dates in new rows. df1 is an example of the data I am working with and df2 is an example of what I've managed to achieve (where I'm stuck). df3 is where I would like to end up!

df1
ID     Date       DateStart     DateEnd
1      2/11/2021  2/11/2021     2/17/2021
1      2/19/2021  2/19/2021     2/21/2021
2      1/15/2021  1/15/2021     1/20/2021  
2      1/22/2021  1/22/2021     1/23/2021  

This is where I am with this. The NAs aren't an issue because I intend to drop the DateStart and DateEnd columns after doing what I need to do. The issue here is that I don't want to include the dates that fall within the previous DateStart and DateEnd range. To get here I grouped by ID and filled in the missing dates between the dates in df1:

df2
ID     Date       DateStart     DateEnd
1      2/11/2021  2/11/2021     2/17/2021
1      2/12/2021  NA            NA
1      2/13/2021  NA            NA
1      2/14/2021  NA            NA
1      2/15/2021  NA            NA
1      2/16/2021  NA            NA
1      2/17/2021  NA            NA
1      2/18/2021  NA            NA
1      2/19/2021  2/19/2021     2/21/2021
2      1/15/2021  1/15/2021     1/20/2021
2      1/16/2021  NA            NA
2      1/17/2021  NA            NA
2      1/18/2021  NA            NA
2      1/19/2021  NA            NA
2      1/20/2021  NA            NA
2      1/21/2021  NA            NA
2      1/22/2021  NA            NA    
2      1/23/2021  1/23/2021     1/24/2021  

This is actually what I'd like to end up with:

df3
ID     Date       DateStart     DateEnd
1      2/11/2021  2/11/2021     2/17/2021
1      2/18/2021  NA            NA
1      2/19/2021  2/19/2021     2/21/2021
2      1/15/2021  1/15/2021     1/20/2021
2      1/21/2021  NA            NA
2      1/22/2021  NA            NA    
2      1/23/2021  1/23/2021     1/24/2021  

In df3 the missing dates are filled in but not the dates within the DateStart-DateEnd range.

Any thoughts on how to achieve this? Note: I have a dataset with a large number of observations.


回答1:


  • Convert date columns to date class.

  • For each ID use complete to create sequence of dates from minimum of DateStart to maximum of DateEnd.

  • fill the NA values with previous non-NA except where Date > DateEnd.

  • For every group of ID, DateStart and DateEnd keep the rows with NA values or row number 1 in each group.

library(dplyr)
library(tidyr)

df %>%
  mutate(across(-ID, lubridate::mdy)) %>%
  group_by(ID) %>%
  complete(Date = seq(min(DateStart), max(DateEnd), by = '1 day')) %>%
  fill(DateStart, DateEnd) %>%
  ungroup %>%
  mutate(across(c(DateStart, DateEnd), ~replace(., Date > DateEnd, NA))) %>%
  group_by(ID, DateStart, DateEnd) %>%
  filter(is.na(DateStart) | row_number() == 1)

#     ID Date       DateStart  DateEnd   
#  <int> <date>     <date>     <date>    
#1     1 2021-02-11 2021-02-11 2021-02-17
#2     1 2021-02-18 NA         NA        
#3     1 2021-02-19 2021-02-19 2021-02-21
#4     2 2021-01-15 2021-01-15 2021-01-20
#5     2 2021-01-21 NA         NA        
#6     2 2021-01-22 NA         NA        
#7     2 2021-01-23 2021-01-23 2021-01-24

data

df <- structure(list(ID = c(1L, 1L, 2L, 2L), Date = c("2/11/2021", 
"2/19/2021", "1/15/2021", "1/23/2021"), DateStart = c("2/11/2021", 
"2/19/2021", "1/15/2021", "1/23/2021"), DateEnd = c("2/17/2021", 
"2/21/2021", "1/20/2021", "1/24/2021")), 
class = "data.frame", row.names = c(NA, -4L))


来源:https://stackoverflow.com/questions/66152303/how-do-i-remove-rows-based-on-a-range-of-dates-given-by-values-in-2-columns

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!