aggregate a matrix (or data.frame) by column name groups in R

问题

I have a large matrix with about 3000 columns x 3000 rows. I'd like to aggregate (calculate the mean) grouped by column names for every row. Each column is named similar to this method...(and in random order)

 Tree Tree House House Tree Car Car House

I would need the data result (aggregation of mean of every row) to have the following columns:

  Tree House Car

the tricky part (at least for me) is that I do not know all the column names and they are all in random order!

回答1:

You could try

res1 <- vapply(unique(colnames(m1)), function(x) 
      rowMeans(m1[,colnames(m1)== x,drop=FALSE], na.rm=TRUE),
                             numeric(nrow(m1)) )

res2 <-  sapply(unique(colnames(m1)), function(x) 
       rowMeans(m1[,colnames(m1)== x,drop=FALSE], na.rm=TRUE) )

identical(res1,res2)
#[1] TRUE

Another option might be to reshape into long form and then do the aggregation

 library(data.table)
 res3 <-dcast.data.table(setDT(melt(m1)), Var1~Var2, fun=mean)[,Var1:= NULL]
 identical(res1, as.matrix(res3))
 [1] TRUE

Benchmarks

It seems like the first two methods are slightly faster for a 3000*3000 matrix

set.seed(24)
m1 <- matrix(sample(0:40, 3000*3000, replace=TRUE), 
   ncol=3000, dimnames=list(NULL, sample(c('Tree', 'House', 'Car'),
    3000,replace=TRUE)))

library(microbenchmark)

f1 <-function() {vapply(unique(colnames(m1)), function(x) 
     rowMeans(m1[,colnames(m1)== x,drop=FALSE], na.rm=TRUE),
                           numeric(nrow(m1)) )}
f2 <- function() {sapply(unique(colnames(m1)), function(x) 
       rowMeans(m1[,colnames(m1)== x,drop=FALSE], na.rm=TRUE) )}

f3 <- function() {dcast.data.table(setDT(melt(m1)), Var1~Var2, fun=mean)[,
            Var1:= NULL]}

microbenchmark(f1(), f2(), f3(), unit="relative", times=10L)
#   Unit: relative
# expr      min       lq     mean   median       uq      max neval
# f1() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    10
# f2() 1.026208 1.027723 1.037593 1.034516 1.028847 1.079004    10
# f3() 4.529037 4.567816 4.834498 4.855776 4.930984 5.529531    10

data

 set.seed(24)
 m1 <- matrix(sample(0:40, 10*40, replace=TRUE), ncol=10, 
     dimnames=list(NULL, sample(c("Tree", "House", "Car"), 10, replace=TRUE)))

回答2:

I came up with my own solution. I first just transpose the matrix (called test_mean) so the columns become rows,then:

# removing numbers from rownames
rownames(test_mean)<-gsub("[0-9.]","",rownames(test_mean))


#aggregate by rownames
test_mean<-aggregate(test_mean, by=list(rownames(test_mean)), FUN=mean)

来源：https://stackoverflow.com/questions/26705553/aggregate-a-matrix-or-data-frame-by-column-name-groups-in-r

标签

aggregate

mean