Aggregating mean “%H%M” in “week” bins in R

问题

I have been struggling with this for a while. I am new to working with ts data and all related R packages. I have a df with several variables including what 'time of day'in GMT "%H%M" and date "%Y/%m/%e" sampling occurred. I want to bin/aggregate my date data into "weeks" (i.e., %W/%g) and calculate the mean 'time of the day' when sampling occurred during that week.

I was able to calculate other FUN on numerical variables (e.g., weight) by first transforming my df into a zoo object and then using aggregate.zoo command as follow:

#calculate the sum weight captured every week 
x2c <- aggregate(OA_zoo, as.Date(cut(time(OA_zoo), "week")), sum)

However, I am not sure how to get around the fact that I am working with Date format rather than num and would appreciate any tips! Also, I have obviously been coding way to much by doing each of my variables separately. Would there be a way of applying different FUN (sum/mean/max/min) on my df by aggregating "weekly" using plyr? Or some other packages?

EDITS/CLARIFICATIONS Here's the dput output of a sample of my full dataset. I have data from 2004-2011. What I would like to look at/plot using ggplot2 is the mean/median of TIME (%H%M) aggregated in period of weeks over time (2004-2011). Right now, my data is not aggregated in week, but is daily (random sample).

> dput(godin)
structure(list(depth = c(878, 1200, 1170, 936, 942, 964, 951, 
953, 911, 969, 960, 987, 991, 997, 1024, 978, 1024, 951, 984, 
931, 1006, 929, 973, 986, 935, 989, 1042, 1015, 914, 984), duration = c(0.8, 
2.6, 6.5, 3.2, 4.1, 6.4, 7.2, 5.3, 7.4, 7, 7, 5.5, 7.5, 7.3, 
7.5, 7, 4.2, 3, 5, 5, 9.3, 7.9, 7.3, 7.2, 7, 5.2, 8, 6, 7.5, 
7), Greenland = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 40L, 28L, 0L, 
0L, 34L, 7L, 28L, 0L, 0L, 0L, 27L, 0L, 0L, 0L, 44L, 59L, 0L, 
0L, 0L, 0L, 0L, 0L), date2 = structure(c(12617, 12627, 12631, 
12996, 12669, 13036, 12669, 13036, 12670, 13036, 12670, 13037, 
12671, 13037, 12671, 13037, 12671, 13038, 12672, 13038, 12672, 
13038, 12672, 13039, 12631, 12997, 12673, 13039, 12673, 13039
), class = "Date"), TIME = c("0940", "0145", "0945", "2045", 
"1615", "0310", "2130", "1045", "0625", "1830", "1520", "0630", 
"0035", "1330", "0930", "2215", "2010", "0645", "0155", "1205", 
"0815", "1845", "2115", "0350", "1745", "0410", "0550", "1345", 
"1515", "2115")), .Names = c("depth", "duration", "Greenland", 
"date2", "TIME"), class = "data.frame", row.names = c("6761", 
"9019", "9020", "9021", "9022", "9023", "9024", "9025", "9026", 
"9027", "9028", "9029", "9030", "9031", "9032", "9033", "9034", 
"9035", "9036", "9037", "9038", "9039", "9040", "9041", "9042", 
"9043", "9044", "9045", "9046", "9047"))

回答1:

I'd approach it like this: first make a column with a string representing the week:

godin$week <- format(godin$date2, "%Y-W%U")

this will give you something like "2004-W26", which will be good enough for aggregate.

then you need to turn your character vector that represents HHMM into an actual time, so that you can use time math on it.

godin$time2 <- as.POSIXct(strptime(godin$TIME, "%H%M"))

NOTE: the above is a bit of a hack...strptime() assumes the current date if none is specified, but that shouldn't get in the way of this particular application, since all converted times will have the same date, the time part of the mean will be correct. I'll strip off the date later...

At that point, I think you can simply aggregate:

x2c <- aggregate(time2~week, data=godin, FUN=mean)

and get rid of the irrelevant (and erroneous) date part

x2c$time2 <- format(x2c$time2,"%H:%M:%S")

et Voila.

> x2c
      week    time2
1 2004-W29 09:40:00
2 2004-W30 01:45:00
3 2004-W31 13:45:00
4 2004-W36 12:07:00
5 2004-W37 10:32:30
6 2005-W31 12:27:30
7 2005-W36 10:48:20
8 2005-W37 13:11:06

The lesson here is that its tricky to push around times with no associated dates in R. I'd love to hear from others who have a better way of doing this.

回答2:

You need to convert the TIME column into common units (i.e. minutes). Here are a couple helper functions to do that:

hour2min <- function(hhmm) {
  hhmm <- as.numeric(hhmm)
  trunc(hhmm/100)*60 + hhmm %% 100
}
min2hour <- function(min) {
  min <- as.numeric(min)
  trunc(min/60)*100 + min %% 60
}

Then you can aggregate the minutes however you'd like. I would personally put them in an xts object and use apply.weekly:

library(xts)
x <- xts(hour2min(godin$TIME), as.Date(godin$date2), dimnames=list(NULL,"MINS"))
w <- apply.weekly(x, mean)
w$TIME <- min2hour(w$MINS)
#                MINS     TIME
# 2004-07-18 580.0000  940.000
# 2004-08-01 585.0000  945.000
# 2004-09-12 711.2500 1151.250
# 2005-08-02 747.5000 1227.500
# 2005-09-11 746.6667 1226.667
# 2005-09-13 764.1667 1244.167

Seconds are still in 100ths of a minute, but could be changed...

回答3:

If you want the average of all observation times per week and you want dates to matter (i.e. 19:00 today and 19:00 tomorrow would average to 07:00 tomorrow morning), then you can do this

godin$datetime <- as.POSIXct(paste(godin$date2, godin$TIME), format="%Y-%m-%d %H%M")
aggregate(godin$datetime, list(format(godin$datetime, "%W/%g")), mean)

#  Group.1                   x
#1   28/04 2004-07-18 09:40:00
#2   30/04 2004-07-31 01:45:00
#3   31/05 2005-08-02 00:27:30
#4   36/04 2004-09-10 13:51:15
#5   36/05 2005-09-11 00:26:40
#6   37/05 2005-09-13 00:44:10

However, I think you want the average time, and you don't care about the date (other than for determining which week it is). In that case, you can use an arbitrary Date as your anchor, and treat all times like they occur on that Date.

godin$stime <- as.POSIXct(paste("1970-01-01", godin$TIME), format='%Y-%m-%d %H%M')
aggregate(godin$stime, list(format(godin$datetime, "%W/%g")), mean)

#  Group.1                   x
#1   28/04 1970-01-01 09:40:00
#2   30/04 1970-01-01 09:45:00
#3   31/05 1970-01-01 12:27:30
#4   36/04 1970-01-01 11:51:15
#5   36/05 1970-01-01 12:26:40
#6   37/05 1970-01-01 12:44:10

Edit

The difference between the above output and the output that @JoshuaUlrich provided is that his is sorted by week. If I sort them, it's like Joshua's, but with seconds instead of 100ths of minutes. (Although, it is a little tricky to sort them the same way because of the format specification that you chose.)

out <- aggregate(godin$stime, list(format(godin$datetime, "%W/%g")), mean)
out[order(as.numeric(paste0(substr(out[, 1], 4, 5), substr(out[, 1], 1, 2)))), ]
#  Group.1                   x
#1   28/04 1970-01-01 09:40:00
#2   30/04 1970-01-01 09:45:00
#4   36/04 1970-01-01 11:51:15
#3   31/05 1970-01-01 12:27:30
#5   36/05 1970-01-01 12:26:40
#6   37/05 1970-01-01 12:44:10

Edit 2

You can format inside aggregate if you just want to get back a string formatted like %H%M

out <- aggregate(godin$stime, list(format(godin$datetime, "%W/%g")), function(TIME) format(mean(TIME), "%H%M"))
out[order(as.numeric(paste0(substr(out[, 1], 4, 5), substr(out[, 1], 1, 2)))), ]
#  Group.1    x
#1   28/04 0940
#2   30/04 0945
#4   36/04 1151
#3   31/05 1227
#5   36/05 1226
#6   37/05 1244

来源：https://stackoverflow.com/questions/11524895/aggregating-mean-hm-in-week-bins-in-r

标签

aggregate-functions

aggregate

time-series