Outputting various subsets from one data frame based on dates

问题

I want to create numerous subsets of data based on date sequences defined from a separate dataframe. For example, one dataframe will have dates and daily recorded values across multiple years. I have created a hypothetical dataframe below. I want to conduct various subsets from this dataframe based on start and end dates defined elsewhere.

set.seed(24)
df1 <- as.data.frame(matrix(sample(0:3000, 300*10, replace=TRUE), ncol=1))
df2 <- as.data.frame(seq(as.Date("2004/1/1"), by = "day", length.out = 3000))
Example <- cbind(df1,df2)

The start and end dates correspond to a sequence of 1 year prior to a particular sample. So if I sampled on the 18/05/2006, I would want all values between 17/05/2005 - 17/05/2006. I have created an example series of dates below via the Lubridate package.

Sample_dates<- as.data.frame(dmy(c("18/05/2006","07/05/2010","01/04/2011",
         "26/10/2006","24/09/2010","27/09/2011")))
End_dates <- (Sample_dates)-days(1) 
Start_dates <- (End_dates)-years(1)
Sequence_dates <- cbind(Start_dates,End_dates)
colnames(Sequence_dates) <- c("Startdates", "Enddates")

Subsequently, I should have 6 subsetted outputs from the original dataframe (Example) based on date sequences defined in the second dataframe (Sequence_dates). In reality, several more sample dates exist so a function recognizing these start and end dates in one section of coding would be preferable to manually selecting each start and finish date. I thought a loop function seems to be strong possibility and I tried the following based on a similar (more complex) post found elsewhere. For() loop to ID dates that are between others and calculate a mean value.

for (i in 1:nrow(Sequence_dates)){
Selected_dates[i] = is.between(Sequence_dates$Startdates[i], Discharge_dates$Enddates[i])
}

However, R does not recognise is.between and I appreciate the code may be sloppy with me never conducting a loop before. Any help on this would be much appreciated!

James

回答1:

I might do as following.

Only end dates seem to be necessary as start dates are just 1 year before.

Loop is made using lapply() which iterates over all end dates.

Subsetting is done mainly with difftime() by filtering any non-zero time difference between the two dates.

set.seed(24)
df1 <- as.data.frame(matrix(sample(0:3000, 300*10, replace=TRUE), ncol=1))
df2 <- as.data.frame(seq(as.Date("2004/1/1"), by = "day", length.out = 3000))

df <- data.frame(df1, df2)
names(df) <- c("val", "date")

library(lubridate)
ends <- c(dmy(c("18/05/2006","07/05/2010","01/04/2011","26/10/2006","24/09/2010","27/09/2011"))) - days(1)

subs <- lapply(ends, function(x) {
    df[difftime(df$date, x - years(1)) >= 0 & difftime(df$date, x) <= 0, ]
})

length(subs)
# [1] 6
min(subs[[1]]$date)
# [1] "2005-05-17"
max(subs[[1]]$date)
# [1] "2006-05-17"

来源：https://stackoverflow.com/questions/30040881/outputting-various-subsets-from-one-data-frame-based-on-dates

标签

loops

subset

lubridate