问题
I have a large matrix with about 3000 columns x 3000 rows. I'd like to aggregate (calculate the mean) grouped by column names for every row. Each column is named similar to this method...(and in random order)
Tree Tree House House Tree Car Car House
I would need the data result (aggregation of mean of every row) to have the following columns:
Tree House Car
- the tricky part (at least for me) is that I do not know all the column names and they are all in random order!
回答1:
You could try
res1 <- vapply(unique(colnames(m1)), function(x)
rowMeans(m1[,colnames(m1)== x,drop=FALSE], na.rm=TRUE),
numeric(nrow(m1)) )
Or
res2 <- sapply(unique(colnames(m1)), function(x)
rowMeans(m1[,colnames(m1)== x,drop=FALSE], na.rm=TRUE) )
identical(res1,res2)
#[1] TRUE
Another option might be to reshape into long form and then do the aggregation
library(data.table)
res3 <-dcast.data.table(setDT(melt(m1)), Var1~Var2, fun=mean)[,Var1:= NULL]
identical(res1, as.matrix(res3))
[1] TRUE
Benchmarks
It seems like the first two methods are slightly faster for a 3000*3000 matrix
set.seed(24)
m1 <- matrix(sample(0:40, 3000*3000, replace=TRUE),
ncol=3000, dimnames=list(NULL, sample(c('Tree', 'House', 'Car'),
3000,replace=TRUE)))
library(microbenchmark)
f1 <-function() {vapply(unique(colnames(m1)), function(x)
rowMeans(m1[,colnames(m1)== x,drop=FALSE], na.rm=TRUE),
numeric(nrow(m1)) )}
f2 <- function() {sapply(unique(colnames(m1)), function(x)
rowMeans(m1[,colnames(m1)== x,drop=FALSE], na.rm=TRUE) )}
f3 <- function() {dcast.data.table(setDT(melt(m1)), Var1~Var2, fun=mean)[,
Var1:= NULL]}
microbenchmark(f1(), f2(), f3(), unit="relative", times=10L)
# Unit: relative
# expr min lq mean median uq max neval
# f1() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10
# f2() 1.026208 1.027723 1.037593 1.034516 1.028847 1.079004 10
# f3() 4.529037 4.567816 4.834498 4.855776 4.930984 5.529531 10
data
set.seed(24)
m1 <- matrix(sample(0:40, 10*40, replace=TRUE), ncol=10,
dimnames=list(NULL, sample(c("Tree", "House", "Car"), 10, replace=TRUE)))
回答2:
I came up with my own solution. I first just transpose the matrix (called test_mean) so the columns become rows,then:
# removing numbers from rownames
rownames(test_mean)<-gsub("[0-9.]","",rownames(test_mean))
#aggregate by rownames
test_mean<-aggregate(test_mean, by=list(rownames(test_mean)), FUN=mean)
来源:https://stackoverflow.com/questions/26705553/aggregate-a-matrix-or-data-frame-by-column-name-groups-in-r