问题
I am trying to find lazy/easy ways of creating summary tables/data.frames
from wide data.frames
. Assume a following data.frame, but with many more columns so that specifying the column names takes a long time:
set.seed(2)
x <- data.frame(Rep = rep(1:3, 4), Temp = c(rep(10,6), rep(20,6)),
pH = rep(c(rep(8.1, 3), rep(7.6, 3)), 2),
Var1 = rnorm(12, 5,2), Var2 = c(rnorm(6,4,1), rnorm(6,3,5)),
Var3 = rt(12, 20))
x[1:3] <- as.data.frame(apply(x[1:3], 2, function(x) as.factor(x)))
Now I can calculate summary statistics with plyr
:
(mu <- ddply(x, .(Temp, pH), numcolwise(mean)))
(std <- ddply(x, .(Temp, pH), numcolwise(sd)))
(n <- ddply(x, .(Temp, pH), numcolwise(length)))
But I am not able to figure out how to apply all of these functions at the same time:
ddply(x, .(Temp, pH), numcolwise(mean, sd, length))
I could of course merge various summary data.tables, but this would not be a "lazy / easy" way of doing this. I am looking for something general that I could apply in many instances. Something like this with the exception that it should be possible to generate with a single function:
xx <- merge(mu, std, by = c("Temp", "pH"), sotr = F)
colnames(xx) <- gsub("x", "mean", colnames(xx))
colnames(xx) <- gsub("y", "sd", colnames(xx))
xx <- merge(xx, n, by = c("Temp", "pH"), sotr = F)
colnames(xx)[(ncol(xx)-2):ncol(xx)] <-
paste0(colnames(xx)[(ncol(xx)-2):ncol(xx)], ".length")
xx <- xx[c("Temp", "pH", grep("Var1", colnames(xx), value = T),
grep("Var2", colnames(xx), value = T),
grep("Var3", colnames(xx), value = T))]
xx
Temp pH Var1.mean Var1.sd Var1.length Var2.mean Var2.sd Var2.length Var3.mean Var3.sd Var3.length
1 10 7.6 4.281195 1.352194 3 3.534447 1.652884 3 0.1529616 1.076276 3
2 10 8.1 5.583853 2.491672 3 4.116622 1.478286 3 1.1611944 1.081301 3
3 20 7.6 5.840411 1.120549 3 6.907273 8.628021 3 0.1301949 1.764201 3
4 20 8.1 6.635154 2.232262 3 8.893188 4.208087 3 0.5509202 1.187431 3
Is this possible to do in R currently? Any advice would be greatly appreciated.
回答1:
Base R's aggregate
can actually handle this, but in a strange way:
(temp <- aggregate(. ~ Temp + pH, x, function(y) cbind(mean(y), sd(y), length(y))))
# Temp pH Rep.1 Rep.2 Rep.3 Var1.1 Var1.2 Var1.3 Var2.1 Var2.2 Var2.3
# 1 10 7.6 2 1 3 4.281195 1.352194 3.000000 3.534447 1.652884 3.000000
# 2 20 7.6 2 1 3 5.840411 1.120549 3.000000 6.907273 8.628021 3.000000
# 3 10 8.1 2 1 3 5.583853 2.491672 3.000000 4.116622 1.478286 3.000000
# 4 20 8.1 2 1 3 6.635154 2.232262 3.000000 8.893188 4.208087 3.000000
# Var3.1 Var3.2 Var3.3
# 1 0.1529616 1.0762763 3.0000000
# 2 0.1301949 1.7642008 3.0000000
# 3 1.1611944 1.0813007 3.0000000
# 4 0.5509202 1.1874306 3.0000000
str(temp)
# 'data.frame': 4 obs. of 6 variables:
# $ Temp: Factor w/ 2 levels "10","20": 1 2 1 2
# $ pH : Factor w/ 2 levels "7.6","8.1": 1 1 2 2
# $ Rep : num [1:4, 1:3] 2 2 2 2 1 1 1 1 3 3 ...
# $ Var1: num [1:4, 1:3] 4.28 5.84 5.58 6.64 1.35 ...
# $ Var2: num [1:4, 1:3] 3.53 6.91 4.12 8.89 1.65 ...
# $ Var3: num [1:4, 1:3] 0.153 0.13 1.161 0.551 1.076 ...
Notice that when we look at the structure of the output, we find that "Rep", "Var1", and so on are actually matrices. So, you can extract them and cbind
them. But, that's somewhat tedious.
I had to do something similar once a while back, and I ended up just writing a basic wrapper around aggregate
that looks like this.
aggregate2 <- function(data, aggs, ids, funs = NULL, ...) {
if (identical(aggs, "."))
aggs <- setdiff(names(data), ids)
if (identical(ids, "."))
ids <- setdiff(names(data), aggs)
if (is.null(funs))
stop("Aggregation function missing")
myformula <- as.formula(
paste(sprintf("cbind(%s)", paste(aggs, collapse = ", ")),
" ~ ", paste(ids, collapse = " + ")))
temp <- aggregate(
formula = eval(myformula), data = data,
FUN = function(x) sapply(seq_along(funs),
function(z) eval(call(funs[z], quote(x)))), ...)
temp1 <- do.call(cbind, lapply(temp[-c(1:length(ids))], as.data.frame))
names(temp1) <- paste(rep(aggs, each = length(funs)), funs, sep = ".")
cbind(temp[1:length(ids)], temp1)
}
Here's how you apply it to your example data.
(temp2 <- aggregate2(x, ".", c("Temp", "pH"), c("mean", "sd", "length")))
# Temp pH Rep.mean Rep.sd Rep.length Var1.mean Var1.sd Var1.length Var2.mean
# 1 10 7.6 2 1 3 4.281195 1.352194 3 3.534447
# 2 20 7.6 2 1 3 5.840411 1.120549 3 6.907273
# 3 10 8.1 2 1 3 5.583853 2.491672 3 4.116622
# 4 20 8.1 2 1 3 6.635154 2.232262 3 8.893188
# Var2.sd Var2.length Var3.mean Var3.sd Var3.length
# 1 1.652884 3 0.1529616 1.076276 3
# 2 8.628021 3 0.1301949 1.764201 3
# 3 1.478286 3 1.1611944 1.081301 3
# 4 4.208087 3 0.5509202 1.187431 3
And, the structure is what we expect.
str(temp2)
# 'data.frame': 4 obs. of 14 variables:
# $ Temp : Factor w/ 2 levels "10","20": 1 2 1 2
# $ pH : Factor w/ 2 levels "7.6","8.1": 1 1 2 2
# $ Rep.mean : num 2 2 2 2
# $ Rep.sd : num 1 1 1 1
# $ Rep.length : num 3 3 3 3
# $ Var1.mean : num 4.28 5.84 5.58 6.64
# $ Var1.sd : num 1.35 1.12 2.49 2.23
# $ Var1.length: num 3 3 3 3
# $ Var2.mean : num 3.53 6.91 4.12 8.89
# $ Var2.sd : num 1.65 8.63 1.48 4.21
# $ Var2.length: num 3 3 3 3
# $ Var3.mean : num 0.153 0.13 1.161 0.551
# $ Var3.sd : num 1.08 1.76 1.08 1.19
# $ Var3.length: num 3 3 3 3
If you don't want to use the function, this is the part that specifically deals with working with the output of aggregate
, as applied to the "temp" object we created at the start of this answer:
temp1 <- do.call(cbind, lapply(temp[-c(1:2)], as.data.frame))
funs <- c("mean", "sd", "length")
names(temp1) <- paste(rep(setdiff(names(temp), c("pH", "Temp")),
each = length(funs)), funs, sep = ".")
cbind(temp[1:2], temp1)
Update: A more simple solution
It turns out that you can actually just do:
do.call(data.frame,
aggregate(. ~ Temp + pH, x, function(y) cbind(mean(y), sd(y), length(y))))
The downside here is that the names are less descriptive than the aggregate2
function I shared, but that can be fixed with a pretty straightforward call to names
.
回答2:
One way to do it with reshape2
and plyr
. But you get results with variables in rows instead of columns :
library(reshape2)
library(plyr)
md <- melt(x[,-1], id.vars=c("Temp","pH"))
ddply(md, c("Temp", "pH", "variable"), summarize, mean=mean(value), sd=sd(value))
Which gives :
Temp pH variable mean sd
1 10 7.6 Var1 4.2811952 1.352194
2 10 7.6 Var2 3.5344474 1.652884
3 10 7.6 Var3 0.1529616 1.076276
4 10 8.1 Var1 5.5838533 2.491672
5 10 8.1 Var2 4.1166215 1.478286
6 10 8.1 Var3 1.1611944 1.081301
7 20 7.6 Var1 5.8404110 1.120549
8 20 7.6 Var2 6.9072734 8.628021
9 20 7.6 Var3 0.1301949 1.764201
10 20 8.1 Var1 6.6351538 2.232262
11 20 8.1 Var2 8.8931884 4.208087
12 20 8.1 Var3 0.5509202 1.187431
If you want your results in a wide format, you can use reshape
:
md <- melt(x[,-1], id.vars=c("Temp","pH"))
result <- ddply(md, c("Temp", "pH", "variable"), summarize, mean=mean(value), sd=sd(value))
reshape(result, idvar=c("Temp","pH"), timevar="variable",direction="wide")
Temp pH mean.Var1 sd.Var1 mean.Var2 sd.Var2 mean.Var3 sd.Var3
1 10 7.6 4.281195 1.352194 3.534447 1.652884 0.1529616 1.076276
4 10 8.1 5.583853 2.491672 4.116622 1.478286 1.1611944 1.081301
7 20 7.6 5.840411 1.120549 6.907273 8.628021 0.1301949 1.764201
10 20 8.1 6.635154 2.232262 8.893188 4.208087 0.5509202 1.187431
来源:https://stackoverflow.com/questions/14749237/summary-data-tables-from-wide-data-frames