Use of ddply + mutate with a custom function?

半腔热情 提交于 2019-12-04 10:17:01

You're mostly right. ddply indeed breaks your data down into mini data frames based on the grouper, and applies a function to each piece.

With ddply, all the work is done with data frames, so the .fun argument must take a (mini) data frame as input and return a data frame as output.

mutate and summarize are functions that fit this bill (they take and return data frames). You can view their individual help pages, or run them on a data frame outside of ddply to see this, e.g.

mutate(mtcars, mean.mpg = mean(mpg))
summarize(mtcars, mean.mpg = mean(mpg))

If you don't use mutate or summarize, that is, you only use a custom function, then your function also needs to take a (mini) data frame as argument, and return a data frame.

If you do use mutate or summarize, any other functions you pass to ddply aren't used by ddply, they're just passed on to be used by mutate or summarize. And functions used by mutate and summarize act on the columns of the data, not on the entire data.frame. This is why

ddply(mtcars, "cyl", mutate, mean.mpg = mean(mpg))

Notice that we don't pass mutate a function. We don't say ddply(mtcars, "cyl", mutate, mean). We have to tell it what to take the mean of. In ?mutate, the description of ... is "named parameters giving definitions of new columns", not anything to do with functions. (Is mean() really different from any "custom function"? No.)

Thus it doesn't work with anonymous functions--or functions at all. Pass it an expression! You can define a custom function beforehand.

custom_function <- function(x) {mean(x + runif(length(x))}
ddply(mtcars, "cyl", mutate, jittered.mean.mpg = custom_function(mpg))
ddply(mtcars, "cyl", summarize, jittered.mean.mpg = custom_function(mpg))

This extends well, you can have functions that take multiple arguments, and you can give them different columns as arguments, but if you're using the mutate or summarize, you have to give the other functions arguments; you're not just passing the functions.

You seem to want to pass ddply a function that already "knows" which column to take the mean of. For that, I think you'd need to not use mutate or summarize, but you can hack your own version. For summarize-like behavior, return a data.frame with a single value, for mutate-like behavior, return the original data.frame with your extra value cbinded on

mean.mpg.mutate = function(df) {
    cbind.data.frame(df, mean.mpg = mean(df$mpg))
}

mean.mpg.summarize = function(df) {
    data.frame(mean.mpg = mean(df$mpg))
}

ddply(mtcars, "cyl", mean.mpg.mutate)
ddply(mtcars, "cyl", mean.mpg.summarize)

tl;dr

Why can't I use mutate with a custom function? Is it just that "built-in" functions return some sort of class that ddply can deal with vs. having to kick out a full data.frame and then call out only the column I care about?

Quite the opposite! mutate and summarize take data frames as inputs and kick out data frames as returns. But mutate and summarize are the functions you're passing to ddply, not mean or whatever else.

Mutate and summarize are convenience functions that you'll use 99% of the time you use ddply.

If you don't use mutate/summarize, then your function needs to take and return a data frame.

If you do use mutate/summarize, then you don't pass them functions, you pass them expressions that can be evaluated with your (mini) data frame. If it's mutate, the return should be a vector to be appended to the data (recycled as necessary). If it's summarize, the return should be a single value. You don't pass a function, like mean; you pass an expression, like mean(mpg).


What about dplyr?

This was written before dplyr was a thing, or at least a big thing. dplyr removes a lot of the confusion from this process because it essentially replaces the nesting of ddply with mutate or summarize as arguments with sequential functions group_by followed by mutate or summarize. The dplyr version of my answer would be

library(dplyr)
group_by(mtcars, cyl) %>%
    mutate(mean.mpg = mean(mpg))

With the new column creation passed directly to mutate (or summarize), there isn't confusion about which function does what.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!