Efficient apply or mapply for multiple matrix arguments by row

前端未结

关注

 2  1610

I have two matrices that I want to apply a function to, by rows:

matrixA
           GSM83009  GSM83037  GSM83002  GSM83029  GSM83041
100001_at  5.873321  5.4


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  误落风尘        
                
              
                            
                2020-12-29 08:29
              
            
            
                                                                       
Splitting the matrices isn't the biggest contributor to evaluation time.

set.seed(21)
matrixA <- matrix(rnorm(5 * 9000), nrow = 9000)
matrixB <- matrix(rnorm(4 * 9000), nrow = 9000)

system.time( scores <- mapply(t.test.stat,
    split(matrixA, row(matrixA)), split(matrixB, row(matrixB))) )
#    user  system elapsed 
#    1.57    0.00    1.58 
smA <- split(matrixA, row(matrixA))
smB <- split(matrixB, row(matrixB))
system.time( scores <- mapply(t.test.stat, smA, smB) )
#    user  system elapsed 
#    1.14    0.00    1.14 


Look at the output from Rprof to see that most of the time is--not surprisingly--spent evaluating t.test.stat (mean, var, etc.).  Basically, there's quite a bit of overhead from function calls.

Rprof()
scores <- mapply(t.test.stat, smA, smB)
Rprof(NULL)
summaryRprof()


You may be able to find faster generalized solutions, but none will approach the speed of the vectorized solution below.

Since your function is simple, you can take advantage of the vectorized rowMeans function to do this almost instantaneously (though it's a bit messy):

system.time({
ncA <- NCOL(matrixA)
ncB <- NCOL(matrixB)
ans <- (rowMeans(matrixA)-rowMeans(matrixB)) /
  sqrt( rowMeans((matrixA-rowMeans(matrixA))^2)*(ncA/(ncA-1))/ncA +
        rowMeans((matrixB-rowMeans(matrixB))^2)*(ncB/(ncB-1))/ncB )
})
#    user  system elapsed 
#      0       0       0 
head(ans)
# [1]  0.8272511 -1.0965269  0.9862844 -0.6026452 -0.2477661  1.1896181


UPDATE

Here's a "cleaner" version using a rowVars function:

rowVars <- function(x, na.rm=FALSE, dims=1L) {
  rowMeans((x-rowMeans(x, na.rm, dims))^2, na.rm, dims)*(NCOL(x)/(NCOL(x)-1))
}
ans <- (rowMeans(matrixA)-rowMeans(matrixB)) /
  sqrt( rowVars(matrixA)/NCOL(matrixA) + rowVars(matrixB)/NCOL(matrixB) )

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  离开以前        
                
              
                            
                2020-12-29 08:36
              
            
            
                                                                       
This solution avoids splitting, and lists, so maybe it will be faster than your version:

## original data:
tmp1 <- matrix(sample(1:100, 20), nrow = 5)
tmp2 <- matrix(sample(1:100, 20), nrow = 5)

## combine them together
tmp3 <- cbind(tmp1, tmp2)

## calculate t.stats:
t.stats <- apply(tmp3, 1, function(x) t.test(x[1:ncol(tmp1)], 
  x[(1 + ncol(tmp1)):ncol(tmp3)])$statistic)


Edit: Just tested it on two matrices of 9000 rows and 5 columns each, and it completed in less than 6 seconds:

tmp1 <- matrix(rnorm(5 * 9000), nrow = 9000)
tmp2 <- matrix(rnorm(5 * 9000), nrow = 9000)
tmp3 <- cbind(tmp1, tmp2)
system.time(t.st <- apply(tmp3, 1, function(x) t.test(x[1:5], x[6:10])$statistic))


-> user  system elapsed 

-> 5.640   0.012   5.705 
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复