Summary data tables from wide data.frames

问题

I am trying to find lazy/easy ways of creating summary tables/data.frames from wide data.frames. Assume a following data.frame, but with many more columns so that specifying the column names takes a long time:

set.seed(2)
x <- data.frame(Rep = rep(1:3, 4), Temp = c(rep(10,6), rep(20,6)), 
pH = rep(c(rep(8.1, 3), rep(7.6, 3)), 2),
Var1 = rnorm(12, 5,2), Var2 = c(rnorm(6,4,1), rnorm(6,3,5)),
Var3 = rt(12, 20))
x[1:3] <- as.data.frame(apply(x[1:3], 2, function(x) as.factor(x)))

Now I can calculate summary statistics with plyr:

(mu <- ddply(x, .(Temp, pH), numcolwise(mean)))
(std <- ddply(x, .(Temp, pH), numcolwise(sd)))
(n  <- ddply(x, .(Temp, pH), numcolwise(length)))

But I am not able to figure out how to apply all of these functions at the same time:

ddply(x, .(Temp, pH), numcolwise(mean, sd, length))

I could of course merge various summary data.tables, but this would not be a "lazy / easy" way of doing this. I am looking for something general that I could apply in many instances. Something like this with the exception that it should be possible to generate with a single function:

xx <- merge(mu, std, by = c("Temp", "pH"), sotr = F)
colnames(xx) <- gsub("x", "mean", colnames(xx))
colnames(xx) <- gsub("y", "sd", colnames(xx))
xx <- merge(xx, n, by = c("Temp", "pH"), sotr = F)
colnames(xx)[(ncol(xx)-2):ncol(xx)] <-
paste0(colnames(xx)[(ncol(xx)-2):ncol(xx)], ".length")
xx <- xx[c("Temp", "pH", grep("Var1", colnames(xx), value = T),
grep("Var2", colnames(xx), value = T),
grep("Var3", colnames(xx), value = T))]
xx

  Temp  pH Var1.mean  Var1.sd Var1.length Var2.mean  Var2.sd Var2.length Var3.mean  Var3.sd Var3.length
1   10 7.6  4.281195 1.352194           3  3.534447 1.652884           3 0.1529616 1.076276           3
2   10 8.1  5.583853 2.491672           3  4.116622 1.478286           3 1.1611944 1.081301           3
3   20 7.6  5.840411 1.120549           3  6.907273 8.628021           3 0.1301949 1.764201           3
4   20 8.1  6.635154 2.232262           3  8.893188 4.208087           3 0.5509202 1.187431           3

Is this possible to do in R currently? Any advice would be greatly appreciated.

回答1:

Base R's aggregate can actually handle this, but in a strange way:

(temp <- aggregate(. ~ Temp + pH, x, function(y) cbind(mean(y), sd(y), length(y))))
#   Temp  pH Rep.1 Rep.2 Rep.3   Var1.1   Var1.2   Var1.3   Var2.1   Var2.2   Var2.3
# 1   10 7.6     2     1     3 4.281195 1.352194 3.000000 3.534447 1.652884 3.000000
# 2   20 7.6     2     1     3 5.840411 1.120549 3.000000 6.907273 8.628021 3.000000
# 3   10 8.1     2     1     3 5.583853 2.491672 3.000000 4.116622 1.478286 3.000000
# 4   20 8.1     2     1     3 6.635154 2.232262 3.000000 8.893188 4.208087 3.000000
#      Var3.1    Var3.2    Var3.3
# 1 0.1529616 1.0762763 3.0000000
# 2 0.1301949 1.7642008 3.0000000
# 3 1.1611944 1.0813007 3.0000000
# 4 0.5509202 1.1874306 3.0000000
str(temp)
# 'data.frame':  4 obs. of  6 variables:
#  $ Temp: Factor w/ 2 levels "10","20": 1 2 1 2
#  $ pH  : Factor w/ 2 levels "7.6","8.1": 1 1 2 2
#  $ Rep : num [1:4, 1:3] 2 2 2 2 1 1 1 1 3 3 ...
#  $ Var1: num [1:4, 1:3] 4.28 5.84 5.58 6.64 1.35 ...
#  $ Var2: num [1:4, 1:3] 3.53 6.91 4.12 8.89 1.65 ...
#  $ Var3: num [1:4, 1:3] 0.153 0.13 1.161 0.551 1.076 ...

Notice that when we look at the structure of the output, we find that "Rep", "Var1", and so on are actually matrices. So, you can extract them and cbind them. But, that's somewhat tedious.

I had to do something similar once a while back, and I ended up just writing a basic wrapper around aggregate that looks like this.

aggregate2 <- function(data, aggs, ids, funs = NULL, ...) {
  if (identical(aggs, "."))
    aggs <- setdiff(names(data), ids)
  if (identical(ids, "."))
    ids <- setdiff(names(data), aggs)
  if (is.null(funs))
    stop("Aggregation function missing")
  myformula <- as.formula(
    paste(sprintf("cbind(%s)", paste(aggs, collapse = ", ")),
          " ~ ", paste(ids, collapse = " + ")))
  temp <- aggregate(
    formula = eval(myformula), data = data,
    FUN = function(x) sapply(seq_along(funs), 
                             function(z) eval(call(funs[z], quote(x)))), ...)
  temp1 <- do.call(cbind, lapply(temp[-c(1:length(ids))], as.data.frame))
  names(temp1) <- paste(rep(aggs, each = length(funs)), funs, sep = ".")
  cbind(temp[1:length(ids)], temp1)
}

Here's how you apply it to your example data.

(temp2 <- aggregate2(x, ".", c("Temp", "pH"), c("mean", "sd", "length")))
#   Temp  pH Rep.mean Rep.sd Rep.length Var1.mean  Var1.sd Var1.length Var2.mean
# 1   10 7.6        2      1          3  4.281195 1.352194           3  3.534447
# 2   20 7.6        2      1          3  5.840411 1.120549           3  6.907273
# 3   10 8.1        2      1          3  5.583853 2.491672           3  4.116622
# 4   20 8.1        2      1          3  6.635154 2.232262           3  8.893188
#    Var2.sd Var2.length Var3.mean  Var3.sd Var3.length
# 1 1.652884           3 0.1529616 1.076276           3
# 2 8.628021           3 0.1301949 1.764201           3
# 3 1.478286           3 1.1611944 1.081301           3
# 4 4.208087           3 0.5509202 1.187431           3

And, the structure is what we expect.

str(temp2)
# 'data.frame':  4 obs. of  14 variables:
#  $ Temp       : Factor w/ 2 levels "10","20": 1 2 1 2
#  $ pH         : Factor w/ 2 levels "7.6","8.1": 1 1 2 2
#  $ Rep.mean   : num  2 2 2 2
#  $ Rep.sd     : num  1 1 1 1
#  $ Rep.length : num  3 3 3 3
#  $ Var1.mean  : num  4.28 5.84 5.58 6.64
#  $ Var1.sd    : num  1.35 1.12 2.49 2.23
#  $ Var1.length: num  3 3 3 3
#  $ Var2.mean  : num  3.53 6.91 4.12 8.89
#  $ Var2.sd    : num  1.65 8.63 1.48 4.21
#  $ Var2.length: num  3 3 3 3
#  $ Var3.mean  : num  0.153 0.13 1.161 0.551
#  $ Var3.sd    : num  1.08 1.76 1.08 1.19
#  $ Var3.length: num  3 3 3 3

If you don't want to use the function, this is the part that specifically deals with working with the output of aggregate, as applied to the "temp" object we created at the start of this answer:

temp1 <- do.call(cbind, lapply(temp[-c(1:2)], as.data.frame))
funs <- c("mean", "sd", "length")
names(temp1) <- paste(rep(setdiff(names(temp), c("pH", "Temp")), 
                          each = length(funs)), funs, sep = ".")
cbind(temp[1:2], temp1)

Update: A more simple solution

It turns out that you can actually just do:

do.call(data.frame, 
        aggregate(. ~ Temp + pH, x, function(y) cbind(mean(y), sd(y), length(y))))

The downside here is that the names are less descriptive than the aggregate2 function I shared, but that can be fixed with a pretty straightforward call to names.

回答2:

One way to do it with reshape2 and plyr. But you get results with variables in rows instead of columns :

library(reshape2)
library(plyr)
md <- melt(x[,-1], id.vars=c("Temp","pH"))
ddply(md, c("Temp", "pH", "variable"), summarize, mean=mean(value), sd=sd(value))

Which gives :

   Temp  pH variable      mean       sd
1    10 7.6     Var1 4.2811952 1.352194
2    10 7.6     Var2 3.5344474 1.652884
3    10 7.6     Var3 0.1529616 1.076276
4    10 8.1     Var1 5.5838533 2.491672
5    10 8.1     Var2 4.1166215 1.478286
6    10 8.1     Var3 1.1611944 1.081301
7    20 7.6     Var1 5.8404110 1.120549
8    20 7.6     Var2 6.9072734 8.628021
9    20 7.6     Var3 0.1301949 1.764201
10   20 8.1     Var1 6.6351538 2.232262
11   20 8.1     Var2 8.8931884 4.208087
12   20 8.1     Var3 0.5509202 1.187431

If you want your results in a wide format, you can use reshape :

md <- melt(x[,-1], id.vars=c("Temp","pH"))
result <- ddply(md, c("Temp", "pH", "variable"), summarize, mean=mean(value), sd=sd(value))
reshape(result, idvar=c("Temp","pH"), timevar="variable",direction="wide")

   Temp  pH mean.Var1  sd.Var1 mean.Var2  sd.Var2 mean.Var3  sd.Var3
1    10 7.6  4.281195 1.352194  3.534447 1.652884 0.1529616 1.076276
4    10 8.1  5.583853 2.491672  4.116622 1.478286 1.1611944 1.081301
7    20 7.6  5.840411 1.120549  6.907273 8.628021 0.1301949 1.764201
10   20 8.1  6.635154 2.232262  8.893188 4.208087 0.5509202 1.187431

来源：https://stackoverflow.com/questions/14749237/summary-data-tables-from-wide-data-frames

标签

dataframe

plyr

summary