Splitting irregular time series into regular monthly averages - R

爱⌒轻易说出口 提交于 2019-12-03 08:42:37

Here's a start using data.table :

billdata <- read.table(text=" acct amount begin end days
1 2242 11349 2009-10-06 2009-11-04 29
2 2242 12252 2009-11-04 2009-12-04 30
3 2242 21774 2009-12-04 2010-01-08 35
4 2242 18293 2010-01-08 2010-02-05 28
5 2243 27217 2009-10-06 2009-11-04 29
6 2243 117 2009-11-04 2009-12-04 30
7 2243 14543 2009-12-04 2010-01-08 35", sep=" ", header=TRUE, row.names=1)

require(data.table)
DT = as.data.table(billdata)

First, change type of columns begin and end to dates. Unlike data.frame, this doesn't copy the entire dataset.

DT[,begin:=as.Date(begin)]
DT[,end:=as.Date(end)]

Then find the time span, find the prevailing bill for each day, and aggregate.

alldays = DT[,seq(min(begin),max(end),by="day")]

setkey(DT, acct, begin)

DT[CJ(unique(acct),alldays),
   mean(amount/days,na.rm=TRUE),
   by=list(acct,month=format(begin,"%Y-%m")), roll=TRUE]

    acct   month        V1
 1: 2242 2009-10 391.34483
 2: 2242 2009-11 406.69448
 3: 2242 2009-12 601.43226
 4: 2242 2010-01 646.27465
 5: 2242 2010-02 653.32143
 6: 2243 2009-10 938.51724
 7: 2243 2009-11  97.36172
 8: 2243 2009-12 375.68065
 9: 2243 2010-01 415.51429
10: 2243 2010-02 415.51429

I think you'll find the prevailing join logic quite cumbersome in SQL, and slower.

I say it's a hint because it's not quite correct. Notice row 10 is repeated because account 2243 doesn't stretch into 2010-02 unlike account 2242. To finish it off you could rbind in the last row for each account and use rolltolast instead of roll. Or perhaps create alldays by account rather than across all accounts.

See if speed is acceptable on the above, and we can go from there.

It's likely you will hit a bug in 1.8.2 that has been fixed in 1.8.3. I'm using v1.8.3.

"Internal" error message when combining join containing missing groups and group by is fixed, #2162. For example : X[Y,.N,by=NonJoinColumn] where Y contains some rows that don't match to X. This bug could also result in a seg fault.

Let me know and we can either work around, or upgrade to 1.8.3 from R-Forge.

Btw, nice example data. That made it quicker to answer.


Here's the full answer alluded to above. It's a bit tricky I have to admit, as it combines together several features of data.table. This should work in 1.8.2 as it happens, but I've only tested in 1.8.3.

DT[ setkey(DT[,seq(begin[1],last(end),by="day"),by=acct]),
    mean(amount/days,na.rm=TRUE),
    by=list(acct,month=format(begin,"%Y-%m")), roll=TRUE]

   acct   month        V1
1: 2242 2009-10 391.34483
2: 2242 2009-11 406.69448
3: 2242 2009-12 601.43226
4: 2242 2010-01 646.27465
5: 2242 2010-02 653.32143
6: 2243 2009-10 938.51724
7: 2243 2009-11  97.36172
8: 2243 2009-12 375.68065
9: 2243 2010-01 415.51429

Here is one way to do it:

billdata <- read.table(text=" acct amount begin end days
1 2242 11349 2009-10-06 2009-11-04 29
2 2242 12252 2009-11-04 2009-12-04 30
3 2242 21774 2009-12-04 2010-01-08 35
4 2242 18293 2010-01-08 2010-02-05 28
5 2243 27217 2009-10-06 2009-11-04 29
6 2243 117 2009-11-04 2009-12-04 30
7 2243 14543 2009-12-04 2010-01-08 35", sep=" ", header=TRUE, row.names=1)

#First, declare your columns "begin" and "end" as dates:
strptime(billdata$begin, format="%Y-%m-%d") -> billdata$begin
strptime(billdata$end, format="%Y-%m-%d") -> billdata$end

#Then create a column with the amount per day on the billing period:
billdata$avg_on_period<-billdata$amount/billdata$days

#Then split it into days:
temp <- data.frame(acct=c(),month=c(),day=c(), avg=c())
for(i in 1:nrow(billdata)){
    X <- billdata[i,]
    seq(X$begin,X$end,by="day") -> list_day
    rbind(temp, data.frame(acct=rep(X$acct,length(list_day)), 
        month=format(list_day, "%Y-%m"), day=format(list_day, "%d"), 
        avg=rep(X$avg_on_period, length(list_day)))) -> temp
    }

# And finally merge the different days of the months together:
output<-aggregate(temp$avg, by=list(temp$month,temp$acct), FUN=mean)

colnames(output) <- c("Month","Account","Average per day")

output
    Month Account Average per day
1 2009-10    2242       391.34483
2 2009-11    2242       406.69448
3 2009-12    2242       595.40000
4 2010-01    2242       645.51964
5 2010-02    2242       653.32143
6 2009-10    2243       938.51724
7 2009-11    2243        97.36172
8 2009-12    2243       364.06250
9 2010-01    2243       415.51429
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!