Rolling sum on an unbalanced time series

问题

I have a series of annual incident counts per category, with no rows for years in which the category did not see an incident. I would like to add a column that shows, for each year, how many incidents occurred in the previous three years.

One way to handle this is to add empty rows for all years with zero incidents, then use rollapply() with a left-aligned four year window, but that would expand my data set more than I want to. Surely there's a way to use ddply() and transform for this?

The following two lines of code build a dummy data set, then execute a simple plyr sum by category:

dat <- data.frame(
   category=c(rep('A',6), rep('B',6), rep('C',6)), 
   year=rep(c(2000,2001,2004,2005,2009, 2010),3), 
   incidents=rpois(18, 3)
   )

ddply(dat, .(category) , transform, i_per_c=sum(incidents) )

That works, but it only shows a per-category total.

I want a total that's year-dependent.

So I try to expand the ddply() call with the function() syntax, like so:

ddply(dat, .(category) , transform, 
      function(x) i_per_c=sum(ifelse(x$year >= year - 4 & x$year < year,  x$incidents, 0) )
      )

This just returns the original data frame, unmodified.

I must be missing something in the plyr syntax, but I don't know what it is.

Thanks, Matt

回答1:

This is sorta ugly, but it works. Nested ply calls:

ddply(dat, .(category), 
    function(datc) adply(datc, 1, 
         function(x) data.frame(run_incidents =
                                sum(subset(datc, year>(x$year-2) & year<=x$year)$incidents))))

There might be a slightly cleaner way to do it, and there are definitely ways that execute much faster.

来源：https://stackoverflow.com/questions/8947952/rolling-sum-on-an-unbalanced-time-series

标签

time-series

plyr