Summary data tables from wide data.frames

孤街浪徒 提交于 2019-12-01 01:12:57
A5C1D2H2I1M1N2O1R2T1

Base R's aggregate can actually handle this, but in a strange way:

(temp <- aggregate(. ~ Temp + pH, x, function(y) cbind(mean(y), sd(y), length(y))))
#   Temp  pH Rep.1 Rep.2 Rep.3   Var1.1   Var1.2   Var1.3   Var2.1   Var2.2   Var2.3
# 1   10 7.6     2     1     3 4.281195 1.352194 3.000000 3.534447 1.652884 3.000000
# 2   20 7.6     2     1     3 5.840411 1.120549 3.000000 6.907273 8.628021 3.000000
# 3   10 8.1     2     1     3 5.583853 2.491672 3.000000 4.116622 1.478286 3.000000
# 4   20 8.1     2     1     3 6.635154 2.232262 3.000000 8.893188 4.208087 3.000000
#      Var3.1    Var3.2    Var3.3
# 1 0.1529616 1.0762763 3.0000000
# 2 0.1301949 1.7642008 3.0000000
# 3 1.1611944 1.0813007 3.0000000
# 4 0.5509202 1.1874306 3.0000000
str(temp)
# 'data.frame':  4 obs. of  6 variables:
#  $ Temp: Factor w/ 2 levels "10","20": 1 2 1 2
#  $ pH  : Factor w/ 2 levels "7.6","8.1": 1 1 2 2
#  $ Rep : num [1:4, 1:3] 2 2 2 2 1 1 1 1 3 3 ...
#  $ Var1: num [1:4, 1:3] 4.28 5.84 5.58 6.64 1.35 ...
#  $ Var2: num [1:4, 1:3] 3.53 6.91 4.12 8.89 1.65 ...
#  $ Var3: num [1:4, 1:3] 0.153 0.13 1.161 0.551 1.076 ...

Notice that when we look at the structure of the output, we find that "Rep", "Var1", and so on are actually matrices. So, you can extract them and cbind them. But, that's somewhat tedious.

I had to do something similar once a while back, and I ended up just writing a basic wrapper around aggregate that looks like this.

aggregate2 <- function(data, aggs, ids, funs = NULL, ...) {
  if (identical(aggs, "."))
    aggs <- setdiff(names(data), ids)
  if (identical(ids, "."))
    ids <- setdiff(names(data), aggs)
  if (is.null(funs))
    stop("Aggregation function missing")
  myformula <- as.formula(
    paste(sprintf("cbind(%s)", paste(aggs, collapse = ", ")),
          " ~ ", paste(ids, collapse = " + ")))
  temp <- aggregate(
    formula = eval(myformula), data = data,
    FUN = function(x) sapply(seq_along(funs), 
                             function(z) eval(call(funs[z], quote(x)))), ...)
  temp1 <- do.call(cbind, lapply(temp[-c(1:length(ids))], as.data.frame))
  names(temp1) <- paste(rep(aggs, each = length(funs)), funs, sep = ".")
  cbind(temp[1:length(ids)], temp1)
}

Here's how you apply it to your example data.

(temp2 <- aggregate2(x, ".", c("Temp", "pH"), c("mean", "sd", "length")))
#   Temp  pH Rep.mean Rep.sd Rep.length Var1.mean  Var1.sd Var1.length Var2.mean
# 1   10 7.6        2      1          3  4.281195 1.352194           3  3.534447
# 2   20 7.6        2      1          3  5.840411 1.120549           3  6.907273
# 3   10 8.1        2      1          3  5.583853 2.491672           3  4.116622
# 4   20 8.1        2      1          3  6.635154 2.232262           3  8.893188
#    Var2.sd Var2.length Var3.mean  Var3.sd Var3.length
# 1 1.652884           3 0.1529616 1.076276           3
# 2 8.628021           3 0.1301949 1.764201           3
# 3 1.478286           3 1.1611944 1.081301           3
# 4 4.208087           3 0.5509202 1.187431           3

And, the structure is what we expect.

str(temp2)
# 'data.frame':  4 obs. of  14 variables:
#  $ Temp       : Factor w/ 2 levels "10","20": 1 2 1 2
#  $ pH         : Factor w/ 2 levels "7.6","8.1": 1 1 2 2
#  $ Rep.mean   : num  2 2 2 2
#  $ Rep.sd     : num  1 1 1 1
#  $ Rep.length : num  3 3 3 3
#  $ Var1.mean  : num  4.28 5.84 5.58 6.64
#  $ Var1.sd    : num  1.35 1.12 2.49 2.23
#  $ Var1.length: num  3 3 3 3
#  $ Var2.mean  : num  3.53 6.91 4.12 8.89
#  $ Var2.sd    : num  1.65 8.63 1.48 4.21
#  $ Var2.length: num  3 3 3 3
#  $ Var3.mean  : num  0.153 0.13 1.161 0.551
#  $ Var3.sd    : num  1.08 1.76 1.08 1.19
#  $ Var3.length: num  3 3 3 3

If you don't want to use the function, this is the part that specifically deals with working with the output of aggregate, as applied to the "temp" object we created at the start of this answer:

temp1 <- do.call(cbind, lapply(temp[-c(1:2)], as.data.frame))
funs <- c("mean", "sd", "length")
names(temp1) <- paste(rep(setdiff(names(temp), c("pH", "Temp")), 
                          each = length(funs)), funs, sep = ".")
cbind(temp[1:2], temp1)

Update: A more simple solution

It turns out that you can actually just do:

do.call(data.frame, 
        aggregate(. ~ Temp + pH, x, function(y) cbind(mean(y), sd(y), length(y))))

The downside here is that the names are less descriptive than the aggregate2 function I shared, but that can be fixed with a pretty straightforward call to names.

One way to do it with reshape2 and plyr. But you get results with variables in rows instead of columns :

library(reshape2)
library(plyr)
md <- melt(x[,-1], id.vars=c("Temp","pH"))
ddply(md, c("Temp", "pH", "variable"), summarize, mean=mean(value), sd=sd(value))

Which gives :

   Temp  pH variable      mean       sd
1    10 7.6     Var1 4.2811952 1.352194
2    10 7.6     Var2 3.5344474 1.652884
3    10 7.6     Var3 0.1529616 1.076276
4    10 8.1     Var1 5.5838533 2.491672
5    10 8.1     Var2 4.1166215 1.478286
6    10 8.1     Var3 1.1611944 1.081301
7    20 7.6     Var1 5.8404110 1.120549
8    20 7.6     Var2 6.9072734 8.628021
9    20 7.6     Var3 0.1301949 1.764201
10   20 8.1     Var1 6.6351538 2.232262
11   20 8.1     Var2 8.8931884 4.208087
12   20 8.1     Var3 0.5509202 1.187431

If you want your results in a wide format, you can use reshape :

md <- melt(x[,-1], id.vars=c("Temp","pH"))
result <- ddply(md, c("Temp", "pH", "variable"), summarize, mean=mean(value), sd=sd(value))
reshape(result, idvar=c("Temp","pH"), timevar="variable",direction="wide")

   Temp  pH mean.Var1  sd.Var1 mean.Var2  sd.Var2 mean.Var3  sd.Var3
1    10 7.6  4.281195 1.352194  3.534447 1.652884 0.1529616 1.076276
4    10 8.1  5.583853 2.491672  4.116622 1.478286 1.1611944 1.081301
7    20 7.6  5.840411 1.120549  6.907273 8.628021 0.1301949 1.764201
10   20 8.1  6.635154 2.232262  8.893188 4.208087 0.5509202 1.187431
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!