Fastest way to filter a data.frame list column contents in R / Rcpp

后端未结

关注

 3  1116

I have a data.frame:

df <- structure(list(id = 1:3, vars = list(\"a\", c(\"a\", \"b\", \"c\"), c(\"b\", 
\"c\"))), .Names = c(\"id\", \"vars\"), row.names


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  不思量自难忘°        
                
              
                            
                2020-12-17 02:47
              
            
            
                                                                       
Setting aside any algorithmic improvements, the analogous data.table solution is automatically going to be faster because you won't have to copy the entire thing just to add a column:

library(data.table)
dt = as.data.table(df)  # or use setDT to convert in place

dt[, newcol := lapply(vars, setdiff, 'a')][sapply(newcol, length) != 0]
#   id  vars newcol
#1:  2 a,b,c    b,c
#2:  3   b,c    b,c


You can also delete the original column (with basically 0 cost), by adding [, vars := NULL] at the end). Or you can simply overwrite the initial column if you don't need that info, i.e. dt[, vars := lapply(vars, setdiff, 'a')].



Now as far as algorithmic improvements go, assuming your id values are unique for each vars (and if not, add a new unique identifier), I think this is much faster and automatically takes care of the filtering:

dt[, unlist(vars), by = id][!V1 %in% 'a', .(vars = list(V1)), by = id]
#   id vars
#1:  2  b,c
#2:  3  b,c


To carry along the other columns, I think it's easiest to simply merge back:

dt[, othercol := 5:7]

# notice the keyby
dt[, unlist(vars), by = id][!V1 %in% 'a', .(vars = list(V1)), keyby = id][dt, nomatch = 0]
#   id vars i.vars othercol
#1:  2  b,c  a,b,c        6
#2:  3  b,c    b,c        7

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  生来不讨喜        
                
              
                            
                2020-12-17 02:51
              
            
            
                                                                       
Here's another way:

# prep
DT <- data.table(df)
DT[,vstr:=paste0(sort(unlist(vars)),collapse="_"),by=1:nrow(DT)]
setkey(DT,vstr)

get_badkeys <- function(x) 
  unlist(sapply(1:length(x),function(n) combn(sort(x),n,paste0,collapse="_")))

# choose values to exclude
baduns  <- c("a","b")

# subset
DT[!J(get_badkeys(baduns))]


This is fairly fast, but it takes up your key.



Benchmarks. Here's a made-up example:

Candidates:

hannahh <- function(df,baduns){
    df %>% 
        mutate(vars = lapply(.$vars, setdiff, baduns)) %>% 
        filter(!!sapply(vars,length))
}
eddi    <- function(df,baduns){
        dt = as.data.table(df)
        dt[, 
          unlist(vars)
        , by = id][!V1 %in% baduns, 
          .(vars = list(V1))
        , keyby = id][dt, nomatch = 0]
}   
stevenb <- function(df,baduns){
    df %>% 
      rowwise() %>% 
      do(id = .$id, vars = .$vars, newcol = setdiff(.$vars, baduns)) %>%
      mutate(length = length(newcol)) %>%
      ungroup() %>%
      filter(length > 0)
}
frank   <- function(df,baduns){
    DT <- data.table(df)
    DT[,vstr:=paste0(sort(unlist(vars)),collapse="_"),by=1:nrow(DT)]
    setkey(DT,vstr)
    DT[!J(get_badkeys(baduns))]
}


Simulation:

nvals  <- 4
nbads  <- 2
maxlen <- 4

nobs   <- 1e4

exdf   <- data.table(
  id=1:nobs,
  vars=replicate(nobs,list(sample(valset,sample(maxlen,1))))
)
setDF(exdf)
baduns <- valset[1:nbads]


Results:

system.time(frank_res   <- frank(exdf,baduns))
#   user  system elapsed 
#   0.24    0.00    0.28 
system.time(hannahh_res <- hannahh(exdf,baduns))
#   0.42    0.00    0.42
system.time(eddi_res    <- eddi(exdf,baduns))
#   0.05    0.00    0.04
system.time(stevenb_res <- stevenb(exdf,baduns))
#   36.27   55.36   93.98


Checks:

identical(sort(frank_res$id),eddi_res$id) # TRUE
identical(unlist(stevenb_res$id),eddi_res$id) # TRUE
identical(unlist(hannahh_res$id),eddi_res$id) # TRUE


Discussion:

For eddi() and hannahh(), the results scarcely change with nvals, nbads and maxlen. In contrast, when baduns goes over 20, frank() becomes incredibly slow (like 20+ sec); it also scales up with nbads and maxlen a little worse than the other two.

Scaling up nobs, eddi()'s lead over hannahh() stays the same, at about 10x. Against frank(), it sometimes shrinks and sometimes stays the same. In the best nobs = 1e5 case for frank(), eddi() is still 3x faster.

If we switch from a valset of characters to something that frank() must coerce to a character for its by-row paste0 operation, both eddi() and hannahh() beat it as nobs grows.



Benchmarks for doing this repeatedly. This is probably obvious, but if you have to do this "many" times (...how many is hard to say), it's better to create the key column than to go through the subsetting for each set of baduns. In the simulation above, eddi() is about 5x as fast as frank(), so I'd go for the latter if I was doing this subsetting 10+ times.

maxbadlen    <- 2
set_o_baduns <- replicate(10,sample(valset,size=sample(maxbadlen,1)))

system.time({
    DT <- data.table(exdf)
    DT[,vstr:=paste0(sort(unlist(vars)),collapse="_"),by=1:nrow(DT)]
    setkey(DT,vstr)

    for (i in 1:10) DT[!J(get_badkeys(set_o_baduns[[i]]))]
})
# user  system elapsed 
# 0.29    0.00    0.29

system.time({
    dt = as.data.table(exdf)
    for (i in 1:10) dt[, 
      unlist(vars), by = id][!V1 %in% set_o_baduns[[i]],
      .(vars = list(V1)), keyby = id][dt, nomatch = 0]
})
# user  system elapsed 
# 0.39    0.00    0.39

system.time({
    for (i in 1:10) hannahh(exdf,set_o_baduns[[i]])
})
# user  system elapsed 
# 4.10    0.00    4.13


So, as expected, frank() takes very little time for additional evaluations, while eddi() and hannahh() grow linearly.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  南旧        
                
              
                            
                2020-12-17 02:51
              
            
            
                                                                       
Here's another idea:

df %>% 
  rowwise() %>% 
  do(id = .$id, vars = .$vars, newcol = setdiff(.$vars, "a")) %>%
  mutate(length = length(newcol)) %>%
  ungroup()


Which gives:

#  id    vars newcol length
#1  1       a             0
#2  2 a, b, c   b, c      2
#3  3    b, c   b, c      2


You could then filter on length > 0 to keep only non-empty newcol

df %>% 
  rowwise() %>% 
  do(id = .$id, vars = .$vars, newcol = setdiff(.$vars, "a")) %>%
  mutate(length = length(newcol)) %>%
  ungroup() %>%
  filter(length > 0)


Which gives:

#  id    vars newcol length
#1  2 a, b, c   b, c      2
#2  3    b, c   b, c      2


Note: As mentioned by @Arun in the comments, this approach is quite slow. You are better off with the data.table solutions.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复