fast subsetting in R

后端未结

关注

 5  641

情书的邮戳 2021-02-03 14:00

I\'ve got a dataframe dat of size 30000 x 50. I also have a separate list that contains points to groupings of rows from this dataframe, e.g.,

rows <- list(c(


      
      
        
          5条回答        

        
                    
            
            
                         
                
              
              
                
                   -上瘾入骨i
                                             
                
                
                (楼主)
            
              
              
                2021-02-03 14:25
              

            
            
                        
Here's one attempt at a speedup - it hinges on the fact that it is faster to look up a row index than to look up a row name, and so tries to make a mapping of rowname to rownumber in dat.

First create some data of the same size as yours and assign some numeric rownames:

> dat <- data.frame(matrix(runif(30000*50),ncol=50))
> rownames(dat) <- as.character(sample.int(nrow(dat)))
> rownames(dat)[1:5]
[1] "21889" "3050"  "22570" "28140" "9576" 


Now generate a random rows with 15000 elements, each of 50 random numbers from 1 to 30000 (being row*names* in this case):

# 15000 groups of up to 50 rows each
> rows <- sapply(1:15000, function(i) as.character(sample.int(30000,size=sample.int(50,size=1))))


For timing purposes, try the method in your question (ouch!):

# method 1
> system.time((res1 <- lapply(rows,function(r) dat[r,])))
   user  system elapsed 
182.306   0.877 188.362 


Now, try to make a mapping from row name to row number. map[i] should give the row number with name i.

FIRST if your row names are a permutation of 1:nrow(dat) you're in luck! All you have to do is sort the rownames, and return the indices:

> map <- sort(as.numeric(rownames(dat)), index.return=T)$ix
# NOTE: map[ as.numeric(rowname) ] -> rownumber into dat for that rowname.


Now look up row indices instead of row names:

> system.time((res2 <- lapply(rows,function(r) dat[map[as.numeric(r)],])))
   user  system elapsed
 32.424   0.060  33.050


Check we didn't screw anything up (note it is sufficient to match the rownames since rownames are unique in R):

> all(rownames(res1)==rownames(res2))
[1] TRUE


So, a ~6x speedup. Still not amazing though...

SECOND If you're unlucky and your rownames are not at all related to nrow(dat), you could try this, but only if max(as.numeric(rownames(dat))) is not too much bigger than nrow(dat). It basically makes map with map[rowname] giving the row number, but since the rownames are not necessarily continuous any more there can be heaps of gaps in map which wastes a bit of memory:

map <- rep(-1,max(as.numeric(rownames(dat))))
obj <- sort(as.numeric(rownames(dat)), index.return=T)
map[obj$x] <- obj$ix


Then use map as before (dat[map[as.numeric(r),]]).
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它5个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复