问题
I have a very big dataset with a DateTime
Column containing POSIXct-Values. I need to determine the season (Winter - Summer) based on the DateTime
column. I've created a function which works fine on a small dataset, but crashes when I use it on the large one. Can anybody see my mistake?
I've created 4 functions:
- 3 subfunctions so that I can do logical comparisons and selection using *apply functions
- 1 function to determine the season
Here are thefunctions:
require(lubridate)
# function for logical comparison (to be used in *apply)
greaterOrEqual <- function(x,y){
ifelse(x >= y,T,F)
}
# function for logical comparison (to be used in *apply)
less <- function(x,y){
ifelse(x < y,T,F)
}
# function for logical comparison (to be used in *apply)
selFromLogic <- function(VecLogic,VecValue){
VecValue[VecLogic]
}
# Main Function to determine the season
getTwoSeasons <- function(input.date) {
Winter1Start <- as.POSIXct("2000-01-01 00:00:00", tz = "UTC")
Winter1End <- as.POSIXct("2000-04-15 23:59:59", tz = "UTC")
SummerStart <- Winter1End + 1
SummerEnd <- as.POSIXct("2000-10-15 23:59:59", tz = "UTC")
Winter2Start <- SummerEnd + 1
Winter2End <- as.POSIXct("2000-12-31 00:00:00", tz = "UTC")
year(input.date) <- year(Winter1Start)
attr(input.date, "tzone") <- attr(Winter1Start, "tzone")
SeasonStart <- c(Winter1Start,SummerStart,Winter2Start)
SeasonsEnd <- c(Winter1End,SummerEnd,Winter2End)
Season_names <- as.factor(c("WinterHalfYear","SummerHalfYear","WinterHalfYear"))
Season_select <- sapply(SeasonStart, greaterOrEqual, x = input.date) & sapply(SeasonsEnd, less, x = input.date)
Season_return <- apply(Season_select,MARGIN = 1,selFromLogic,VecValue = Season_names)
return(Season_return)
}
And here's a way to test the function:
dates <- Sys.time() + seq(0,10000,10)
getTwoSeasons(dates)
I would be thankful for any help, this is driving me crazy!
回答1:
And if you're interested in getting back four seasons, here's code to do that:
library(lubridate)
getSeason <- function(input.date){
numeric.date <- 100*month(input.date)+day(input.date)
## input Seasons upper limits in the form MMDD in the "break =" option:
cuts <- base::cut(numeric.date, breaks = c(0,319,0620,0921,1220,1231))
# rename the resulting groups (could've been done within cut(...levels=) if "Winter" wasn't double
levels(cuts) <- c("Winter","Spring","Summer","Fall","Winter")
return(cuts)
}
Unit Test:
getSeason(as.POSIXct("2016-01-01 12:00:00")+(0:365)*(60*60*24))
回答2:
I packaged @Lars Arne Jordanger's much more elegant approach into a function:
getTwoSeasons <- function(input.date){
numeric.date <- 100*month(input.date)+day(input.date)
## input Seasons upper limits in the form MMDD in the "break =" option:
cuts <- base::cut(numeric.date, breaks = c(0,415,1015,1231))
# rename the resulting groups (could've been done within cut(...levels=) if "Winter" wasn't double
levels(cuts) <- c("Winter", "Summer","Winter")
return(cuts)
}
Testing it on some sample data seems to work fine:
getTwoSeasons(as.POSIXct("2016-01-01 12:00:00")+(0:365)*(60*60*24))
回答3:
For completeness, worth noting that lubridate
now has a quarter (and a semester) function. quarter
splits the year into fourths and semester
into halves:
quarter(x, with_year = FALSE, fiscal_start = 1)
semester(x, with_year = FALSE)
For more, see: https://www.rdocumentation.org/packages/lubridate/versions/1.7.4/topics/quarter
回答4:
After several hours of debugging I've found my mistake, and it's quite absurd really:
If a season for a DateTimeValue was not found, apply
returned list
-object instead of a vector
(this was the case when the DateTime value equalled 2000-12-31 00:00:00
). Returning a list created an an overproportional increase in computation time and the described crashes. Here's a the fixed code:
# input date and return 2 season
getTwoSeasons <- function(input.date) {
Winter1Start <- as.POSIXct("2000-01-01 00:00:00", tz = "UTC")
Winter1End <- as.POSIXct("2000-04-15 23:59:59", tz = "UTC")
SummerStart <- Winter1End + 1
SummerEnd <- as.POSIXct("2000-10-15 23:59:59", tz = "UTC")
Winter2Start <- SummerEnd + 1
Winter2End <- as.POSIXct("2001-01-01 00:00:01", tz = "UTC")
SeasonStart <- c(Winter1Start,SummerStart,Winter2Start)
SeasonsEnd <- c(Winter1End,SummerEnd,Winter2End)
Season_names <- factor(c("WinterHalf","SummerHalf","WinterHalf"))
year(input.date) <- year(Winter1Start)
attr(input.date, "tzone") <- attr(Winter1Start, "tzone")
Season_selectStart <- vapply(X = SeasonStart,function(x,y){x <= input.date},FUN.VALUE = logical(length(input.date)),y = input.date)
Season_selectEnd <- vapply(X = SeasonsEnd,function(x,y){x > input.date},FUN.VALUE = logical(length(input.date)),y = input.date)
Season_selectBoth <- Season_selectStart & Season_selectEnd
Season_return <- apply(Season_selectBoth,MARGIN = 1,function(x,y){y[x]}, y = Season_names)
return(Season_return)
}
The "sub"-functions are now integrated in the main function and two sapply
functions replaced with vapply
.
PS: There is still an issue with the timezone, since c() strips the timezone away. I'll update the code when I fix it.
回答5:
The following strategy can also be used: The basic observation is that
substr
can extract the month and day information we need in order to
decide if it's summer or winter. The idea is then to convert this to
numbers of the form month.date, and the test for being summer then
boils down to having a number larger than 4.15 but smaller than 10.16.
The example below shows how this can be done when a vector of dates first are transformed into the alternative presentation described above, and then a vector that tells if it is summer "TRUE" or winter "FALSE" will be created based on this.
DateTime <- as.POSIXct(x = "2000-01-01 00:00:00",
tz = "UTC") +
(0:1000)*(60*60*24)
DateTime_2 <- as.numeric(paste(
substr(x = DateTime,
start = 6,
stop = 7),
substr(x = DateTime,
start = 9,
stop = 10),
sep = "."))
.season <- (DateTime_2 > 4.15) & (DateTime_2 < 10.16)
来源:https://stackoverflow.com/questions/36502140/determine-season-from-date-using-lubridate-in-r