data.table in R - multiple filters using multiple keys - binary search

后端未结

关注

 2  948

春和景丽 2020-12-05 15:30

I don\'t understand how I can filter based on multiple keys in data.table. Take the built-in mtcars dataset.

DT <- data.table(mt


      
      
        
          2条回答        

        
                    
            
            
                         
                
              
              
                
                   小蘑菇
                                             
                
                
                (楼主)
            
              
              
                2020-12-05 15:46
              

            
            
                        
This question has now become target of a duplicated question and I felt that the existing answers could be improved to help novice data.table users.

1. What is the difference between  DT[.()] and DT[CJ()]?

According to ?data.table, .() is an alias for list() and a list supplied as parameter i  is converted into a data.table internally. So, DT[.(1, c(3, 4), c(2, 4))] is equivalent to DT[data.table(1, c(3, 4), c(2, 4))] with

data.table(1, c(3, 4), c(2, 4))
#   V1 V2 V3
#1:  1  3  2
#2:  1  4  4


The data.table consists of two rows which is the length of the longest vector. 1 is recycled.

This is different to cross join which creates all combinations of the supplied vectors.

CJ(1, c(3, 4), c(2, 4))
   V1 V2 V3
#1:  1  3  2
#2:  1  3  4
#3:  1  4  2
#4:  1  4  4


Note that setDT(expand.grid()) would produce the same result.

This explains why the OP gets two different results:

DT[.(1, c(3, 4), c(2, 4))]
#   mpg cyl disp  hp drat    wt  qsec vs am gear carb
#1:  NA  NA   NA  NA   NA    NA    NA NA  1    3    2
#2:  21   6  160 110  3.9 2.620 16.46  0  1    4    4
#3:  21   6  160 110  3.9 2.875 17.02  0  1    4    4

DT[CJ(1, c(3, 4), c(2, 4))]
#    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#1:   NA  NA    NA  NA   NA    NA    NA NA  1    3    2
#2:   NA  NA    NA  NA   NA    NA    NA NA  1    3    4
#3: 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
#4: 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
#5: 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#6: 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4


Note that the parameter nomatch = 0 will remove the non-matching rows, i.e., the rows containing NA.

2. Using %in%

Beside CJ() and am == 1 & (gear == 3 | gear == 4) & (carb == 2 | carb == 4), there is a third equivalent option using value matching:

DT[am == 1 & gear %in%  c(3, 4) & carb %in% c(2, 4)]
#    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#1: 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
#2: 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
#3: 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#4: 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4


Note that CJ() requires the data.tableto be keyed while the two other variants also will work with unkeyed data.tables.

3. Benchmarking

Data

In order to test execution speed of the 3 options we need a much larger data.table than just the 32 rows of mtcars. This is achieved by repeatedly doubling mtcars until 1 million rows (89 MB) are reached. Then this data.table is copied to get a keyed version of the same input data.

library(data.table)
# create unkeyed data.table
DT_unkey <- data.table(mtcars)
for (i in 1:15) {
  DT_unkey <- rbindlist(list(DT_unkey, DT_unkey))
  print(nrow(DT_unkey))
}

#create keyed data.table
DT_keyed <- copy(DT_unkey)
setkeyv(DT_keyed, c("am", "gear", "carb"))

# show data.tables
tables()
#     NAME          NROW NCOL MB COLS                                         KEY         
#[1,] DT_keyed 1,048,576   11 89 mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb am,gear,carb
#[2,] DT_unkey 1,048,576   11 89 mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb             
#Total: 178MB


Run

To get a fair comparison, the setkey() operations are included in the timing. Also, the data.tables are explicitely copied to exclude effects from data.table's update by reference.

With

result <- microbenchmark::microbenchmark(
  setkey = {
    DT_keyed <- copy(DT)
    setkeyv(DT_keyed, c("am", "gear", "carb"))},
  cj_keyed = {
    DT_keyed <- copy(DT)
    setkeyv(DT_keyed, c("am", "gear", "carb")) 
    DT_keyed[CJ(1, c(3, 4), c(2, 4)), nomatch = 0]},
  or_keyed = {
    DT_keyed <- copy(DT)
    setkeyv(DT_keyed, c("am", "gear", "carb")) 
    DT_keyed[am == 1 & (gear == 3 | gear == 4) & (carb == 2 | carb == 4)]},
  or_unkey = {
    copy = DT_unkey <- copy(DT)
    DT_unkey[am == 1 & (gear == 3 | gear == 4) & (carb == 2 | carb == 4)]},
  in_keyed =  {
    DT_keyed <- copy(DT)
    setkeyv(DT_keyed, c("am", "gear", "carb")) 
    DT_keyed[am %in% c(1) & gear %in%  c(3, 4) & carb %in% c(2, 4)]},
  in_unkey = {
    copy = DT_unkey <- copy(DT)
    DT_unkey[am %in% c(1) & gear %in%  c(3, 4) & carb %in% c(2, 4)]},
  times = 10L)


we get

print(result)
#Unit: milliseconds
#     expr       min        lq     mean    median       uq      max neval
#   setkey 198.23972 198.80760 209.0392 203.47035 213.7455 245.8931    10
# cj_keyed 210.03574 212.46850 227.6808 216.00190 254.0678 259.5231    10
# or_keyed 244.47532 251.45227 296.7229 287.66158 291.3811 404.8678    10
# or_unkey  69.78046  75.61220 103.6113  89.32464 111.5240 231.6814    10
# in_keyed 269.82501 270.81692 302.3453 274.42716 321.2935 431.9619    10
# in_unkey  93.75537  95.86832 119.4371 100.19446 126.6605 251.4172    10

ggplot2::autoplot(result)




Apparently, setkey() is a rather costly operations. So, for a one time task 
the vector scan operations might be faster than using binary search on a keyed table.

The benchmark was run with R version 3.3.2 (x86_64, mingw32), data.table 1.10.4, microbenchmark 1.4-2.1.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它2个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复

data.table in R - multiple filters using multiple keys - binary search

1. What is the difference between `DT[.()]` and `DT[CJ()]`?

2. Using `%in%`

3. Benchmarking

Data

Run