How can I use variable names to refer to data frame columns with ddply?

痞子三分冷 提交于 2019-12-13 11:49:31

问题


I am trying to write a function that takes as arguments the name of a data frame holding time series data and the name of a column in that data frame. The function performs various manipulations on that data, one of which is adding a running total for each year in a column. I am using plyr.

When I use the name of the column directly with ddply and cumsum I have no problems:

require(plyr)
df <- data.frame(date = seq(as.Date("2007/1/1"),
                     by = "month",
                     length.out = 60),
                 sales = runif(60, min = 700, max = 1200))

df$year <- as.numeric(format(as.Date(df$date), format="%Y"))
df <- ddply(df, .(year), transform,
            cum_sales = (cumsum(as.numeric(sales))))

This is all well and good but the ultimate aim is to be able to pass a column name to this function. When I try to use a variable in place of the column name, it doesn't work as I expected:

mycol <- "sales"
df[mycol]

df <- ddply(df, .(year), transform,
            cum_value2 = cumsum(as.numeric(df[mycol])))

I thought I knew how to access columns by name. This worries me because it suggests that I have failed to understand something basic about indexing and extraction. I would have thought that referring to columns by name in this way would be a common need.

I have two questions.

  1. What am I doing wrong i.e. what have I misunderstood?
  2. Is there a better way of going about this, bearing in mind that the names of the columns will not be known beforehand by the function?

TIA


回答1:


The arguments to ddply are expressions which are evaluated in the context of the each part the original data frame is split into. Your df[myval] addresses the whole data frame, so you cannot pass it as-is (btw, why do you need those as.numeric(as.character()) stuff - they are completely useless).

The easiest way will be to write your own function which will does everything inside and pass the column name down, e.g.

df <- ddply(df, 
            .(year), 
            .fun = function(x, colname) transform(x, cum_sales = cumsum(x[,colname])), 
            colname = "sales")



回答2:


The problem is that ddply expects its last arguments to be expressions, that will be evaluated on chunks of the data.frame (every year, in your example). If you use df[myval], you have the whole data.frame, not the annual chunks.

The following works, but is not very elegant: I build the expression as a string, and then convert it with eval(parse(...)).

ddply( df, .(year), transform, 
  cum_value2 = eval(parse( text = 
    sprintf( "cumsum(as.numeric(as.character(%s)))", mycol )
  ))
)


来源:https://stackoverflow.com/questions/8869005/how-can-i-use-variable-names-to-refer-to-data-frame-columns-with-ddply

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!