fast subsetting in R

后端未结

关注

 5  593

情书的邮戳 2021-02-03 14:00

I\'ve got a dataframe dat of size 30000 x 50. I also have a separate list that contains points to groupings of rows from this dataframe, e.g.,

rows <- list(c(


      
      
        
          5条回答        

        
                    
            
            
                         
                
              
              
                
                   南旧
                                             
                
                
                (楼主)
            
              
              
                2021-02-03 14:24
              

            
            
                        
Update

My original post started with this erroneous statement:


  The problem with indexing via rownames and colnames is that you
  are running a vector/linear scan for each element, eg. you are hunting
  through each row to see which is named "36", then starting from the
  beginning to do it again for "34".


Simon pointed out in the comments here that R apparently uses a hash table for indexing. Sorry for the mistake.

Original Answer

Note that the suggestions in this answer assume that you have non-overlapping subsets of data.

If you want to keep your list-lookup strategy, I'd suggest storing the actual row indices in stead of string names.

An alternative is to store your "group" information as another column to your data.frame, then split your data.frame on its group, eg. let's say your recoded data.frame looks like this:

dat <- data.frame(a=sample(100, 10),
                  b=rnorm(10),
                  group=sample(c('a', 'b', 'c'), 10, replace=TRUE))


You could then do:

split(dat, dat$group)
$a
   a           b group
2 66 -0.08721261     a
9 62 -1.34114792     a

$b
    a          b group
1  32  0.9719442     b
5  79 -1.0204179     b
6  83 -1.7645829     b
7  73  0.4261097     b
10 44 -0.1160913     b

$c
   a          b group
3 77  0.2313654     c
4 74 -0.8637770     c
8 29  1.0046095     c


Or, depending on what you really want to do with your "splits", you can convert your data.frame to a data.table and set its key to your new group column:

library(data.table)
dat <- data.table(dat, key="group")


Now do your list thing -- which will give you the same result as the split above

 x <- lapply(unique(dat$group), function(g) dat[J(g),])


But you probably want to "work over your spits", and you can do that inline, eg:

ans <- dat[, {
  ## do some code over the data in each split
  ## and return a list of results, eg:
  list(nrow=length(a), mean.a=mean(a), mean.b=mean(b))
}, by="group"]

ans
     group nrow mean.a     mean.b
[1,]     a    2   64.0 -0.7141803
[2,]     b    5   62.2 -0.3006076
[3,]     c    3   60.0  0.1240660


You can do the last step in "a similar fashion" with plyr, eg:

library(plyr)
ddply(dat, "group", summarize, nrow=length(a), mean.a=mean(a),
      mean.b=mean(b))
  group nrow mean.a     mean.b
1     a    2   64.0 -0.7141803
2     b    5   62.2 -0.3006076
3     c    3   60.0  0.1240660


But since you mention your dataset is quite large, I think you'd like the speed boost data.table will provide.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它5个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复