R: using ddply in a loop over data frame columns

旧巷老猫 提交于 2019-12-02 06:37:53

问题


I need to calculate and add to a data frame multiple new columns based on the values in each column in a subset of columns in the data frame. These columns all hold time series data (there is a common date column). For example I need to calculate the change for the same month in the previous year for a dozen columns. I could specify them and calculate them individually but that becomes onerous with a large number of columns to transform, so I am trying to automate the process with a for loop.

I was doing OK until I tried to use ddply to create a column for the running total of the value for the year so far. What happens is that ddply is adding new rows during each iteration through the loop and including those new rows in the cumsum calculation. I have two questions.

Q. How can I get ddply to calculate the correct cumsum? Q. How can I specify the name of the column during the ddply call, rather than using a dummy value and renaming afterward?

[Edit: I spoke too soon, the updated code below does NOT work at this point, just FYI]

require(lubridate)
require(plyr)
require(xts)

set.seed(12345)
# create dummy time series data
monthsback <- 24
startdate <- as.Date(paste(year(now()),month(now()),"1",sep = "-")) - months(monthsback)
mydf <- data.frame(mydate = seq(as.Date(startdate), by = "month", length.out = monthsback),
                   myvalue1 = runif(monthsback, min = 600, max = 800),
                   myvalue2 = runif(monthsback, min = 200, max = 300))

mydf$year <- as.numeric(format(as.Date(mydf$mydate), format="%Y"))
mydf$month <- as.numeric(format(as.Date(mydf$mydate), format="%m"))
newcolnames <- c('myvalue1','myvalue2')

for (i in seq_along(newcolnames)) {
    print(newcolnames[i])
    mydf$myxts <- xts(mydf[, newcolnames[i]], order.by = mydf$mydate)
    ## Calculate change over same month in previous year
    mylag <- 12
    mydf[, paste(newcolnames[i], "_yoy", sep = "", collapse = "")] <- as.numeric(diff(mydf$myxts, lag = mylag)/ lag(mydf$myxts, mylag))
    ## Calculate change over previous month
    mylag <- 1
    mydf[, paste(newcolnames[i], "_mom", sep = "", collapse = "")] <- as.numeric(diff(mydf$myxts, lag = mylag)/ lag(mydf$myxts, mylag))

    ## Calculate cumulative figure
    #mydf$newcol <- as.numeric(mydf$myxts)
    mydf$newcol <- 1
    mydf <- ddply(mydf, .(year), transform, newcol = cumsum(as.numeric(mydf$myxts)))
    colnames(mydf)[colnames(mydf)=="newcol"] <- paste(newcolnames[i], "_cuml", sep = "", collapse = "")

}

mydf

回答1:


In your loop, since myxts is not part of the data frame, it is not split up in the ddply statement along with everything else. Change it to:

mydf$myxts <- xts(mydf[, newcolnames[i]], order.by = mydf$mydate)

I don't know of any way to use dynamically generated names with transform.



来源:https://stackoverflow.com/questions/10518925/r-using-ddply-in-a-loop-over-data-frame-columns

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!