Calculating hourly averages from a multi-year timeseries

问题

I have a dataset filled with the average windspeed per hour for multiple years. I would like to create an 'average year', in which for each hour the average windspeed for that hour over multiple years is calculated. How can I do this without looping endlessly through the dataset? Ideally, I would like to just loop through the data once, extracting for each row the right month, day, and hour, and adding the windspeed from that row to the right row in a dataframe where the aggregates for each month, day, and hour are gathered. Is it possible to do this without extracting the month, day, and hour, and then looping over the complete average-year data.frame to find the right row?

Some example data:

data.multipleyears <- data.frame(
 DATETIME = c("2001-01-01 01:00:00", "2001-05-03 09:00:00", "2007-01-01 01:00:00", "2008-02-29 12:00:00"),
 Windspeed = c(10, 5, 8, 3)
)

Which I would like to aggregate in a dataframe like this:

average.year <- data.frame(
 DATETIME = c("01-01 00:00:00", "01-01 01:00:00", ..., "12-31 23:00:00")
 Aggregate.Windspeed = (100, 80, ...)
)

From there, I can go on calculating the averages, etc. I have probably overlooked some command, but what would be the right syntax for something like this (in pseudocode):

 for(i in 1:nrow(data.multipleyears) {
  average.year$Aggregate.Windspeed[
   where average.year$DATETIME(month, day, hour) == data.multipleyears$DATETIME[i](month, day, hour)]  <- average.year$Aggregate.Windspeed + data.multipleyears$Windspeed[i]
 }

Or something like that. Help is appreciated!

回答1:

I predict that ddply and the plyr package are going to be your best friend :). I created a 30 year dataset with hourly random windspeeds between 1 and 10 ms:

begin_date = as.POSIXlt("1990-01-01", tz = "GMT")
# 30 year dataset
dat = data.frame(dt = begin_date + (0:(24*30*365)) * (3600))
dat = within(dat, {
  speed = runif(length(dt), 1, 10)
  unique_day = strftime(dt, "%d-%m")
})
> head(dat)
                   dt unique_day    speed
1 1990-01-01 00:00:00      01-01 7.054124
2 1990-01-01 01:00:00      01-01 2.202591
3 1990-01-01 02:00:00      01-01 4.111633
4 1990-01-01 03:00:00      01-01 2.687808
5 1990-01-01 04:00:00      01-01 8.643168
6 1990-01-01 05:00:00      01-01 5.499421

To calculate the daily normalen (30 year average, this term is much used in meteorology) over this 30 year period:

library(plyr)
res = ddply(dat, .(unique_day), 
            summarise, mean_speed = mean(speed), .progress = "text")
> head(res)
  unique_day mean_speed
1      01-01   5.314061
2      01-02   5.677753
3      01-03   5.395054
4      01-04   5.236488
5      01-05   5.436896
6      01-06   5.544966

This takes just a few seconds on my humble two core AMD, so I suspect just going once through the data is not needed. Multiple of these ddply calls for different aggregations (month, season etc) can be done separately.

回答2:

You can use substr to extract the part of the date you want, and then use tapply or ddply to aggregate the data.

tapply(
  data.multipleyears$Windspeed, 
  substr( data.multipleyears$DATETIME, 6, 19), 
  mean 
)
# 01-01 01:00:00 02-29 12:00:00 05-03 09:00:00 
#              9              3              5 

library(plyr)
ddply(
  data.multipleyears, 
  .(when=substr(DATETIME, 6, 19)), 
  summarize, 
  Windspeed=mean(Windspeed)
)
#             when Windspeed
# 1 01-01 01:00:00         9
# 2 02-29 12:00:00         3
# 3 05-03 09:00:00         5

回答3:

It is pretty old post, but I wanted to add. I guess timeAverage in Openair can also be used. In the manual, there are more options for timeAverage function.

来源：https://stackoverflow.com/questions/10007877/calculating-hourly-averages-from-a-multi-year-timeseries

标签

dataframe

plyr