Subsetting a data.table by range making use of binary search

前端未结

关注

 2  2018

佛祖请我去吃肉 2020-12-31 04:37

How do you go about subsetting a data.table by a numeric range, with the intention of using binary search?

For example:

require(data.table)
set.seed


      
      
        
          2条回答        

        
                    
            
            
                         
                
              
              
                
                   北荒
                                             
                
                
                (楼主)
            
              
              
                2020-12-31 04:58
              

            
            
                        
Interesting question. First let's look at the example data :

> print(DT)
                     x        y
       1: 2.607703e-07 5.748127
       2: 8.894131e-07 5.233994
       3: 1.098961e-06 9.834267
       4: 1.548324e-06 2.016585
       5: 1.569279e-06 7.957730
      ---                      
 9999996: 9.999996e+00 9.977782
 9999997: 9.999998e+00 2.666575
 9999998: 9.999999e+00 6.869967
 9999999: 9.999999e+00 1.953145
10000000: 1.000000e+01 4.001616
> length(DT$x)
[1] 10000000
> length(unique(DT$x))
[1] 9988478
> length(DT$y)
[1] 10000000
> length(unique(DT$y))
[1] 9988225
> DT[,.N,by=x][,table(N)]
N
      1       2       3 
9976965   11504       9 
> DT[,.N,by="x,y"][,table(N)]
N
       1 
10000000 
> 


So there are almost 10 million unique floating point values in the first column: a few groups of size 2 and 3 rows but mostly 1 row groups.  Once the second column is including, there are 10 million unique groups of size 1 row.  This is quite a tough problem, since data.table is designed more for grouped data in mind; e.g, (id, date), (id1, id2, date, time) etc.

However, data.table and setkey do support floating point data in keys, so let's give it a go.

On my slow netbook :

> system.time(setkey(DT,x,y))
   user  system elapsed 
  7.097   0.520   7.650 

> system.time(DT[x>5 & y<7])
   user  system elapsed 
  2.820   0.292   3.122 


So the vector scanning approach is faster than setting the key (and we haven't even used the key yet).  Given the data is floating point and almost unique then this isn't too surprising,  but I think that's a pretty fast time for setkey to sort 10 million thoroughly random and almost unique doubles.

Compare to base for example,  just sorting x not even y as well :

> system.time(base::order(x))
   user  system elapsed 
 72.445   0.292  73.072 


Assuming this data is representative of your real data, and you don't want to do this just once but several times, so are willing to pay the price of setkey, the first step is pretty clear :

system.time(w <- DT[.(5),which=TRUE,roll=TRUE])
   user  system elapsed 
  0.004   0.000   0.003 
> w
[1] 4999902


But here we're stuck. A next step like DT[(w+1):nrow(DT)] is ugly and copies.  I can't think of a decent way to use the key from here to do the y<7 part as well.  In other example data we do something like DT[.(unique(x), 7), which=TRUE, roll=TRUE] but in this case the data is so unique and floating point that's going to be slow.

Ideally, this task needs range joins (FR#203) implementing. The syntax in this example might be :

DT[.( c(5,Inf), c(-Inf,7) )]


or to make it easier, DT[x>5 & y<7] could be optimized to do that under the hood. Allowing a two-column range in i that joins to the corresponding x columns could be quite useful and has come up several times.

The speedups in v1.9.2 needed to be done first before we could move on to things like that. If you try setkey on this data in v1.8.10 you'll find that v1.9.2 is significantly faster.

See also :

How to self join a data.table on a condition


Remove a range in data.table
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它2个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复