Fastest way to filter a data.frame list column contents in R / Rcpp

后端 未结 3 1114
长发绾君心
长发绾君心 2020-12-17 02:02

I have a data.frame:

df <- structure(list(id = 1:3, vars = list(\"a\", c(\"a\", \"b\", \"c\"), c(\"b\", 
\"c\"))), .Names = c(\"id\", \"vars\"), row.names         


        
3条回答
  •  不思量自难忘°
    2020-12-17 02:47

    Setting aside any algorithmic improvements, the analogous data.table solution is automatically going to be faster because you won't have to copy the entire thing just to add a column:

    library(data.table)
    dt = as.data.table(df)  # or use setDT to convert in place
    
    dt[, newcol := lapply(vars, setdiff, 'a')][sapply(newcol, length) != 0]
    #   id  vars newcol
    #1:  2 a,b,c    b,c
    #2:  3   b,c    b,c
    

    You can also delete the original column (with basically 0 cost), by adding [, vars := NULL] at the end). Or you can simply overwrite the initial column if you don't need that info, i.e. dt[, vars := lapply(vars, setdiff, 'a')].


    Now as far as algorithmic improvements go, assuming your id values are unique for each vars (and if not, add a new unique identifier), I think this is much faster and automatically takes care of the filtering:

    dt[, unlist(vars), by = id][!V1 %in% 'a', .(vars = list(V1)), by = id]
    #   id vars
    #1:  2  b,c
    #2:  3  b,c
    

    To carry along the other columns, I think it's easiest to simply merge back:

    dt[, othercol := 5:7]
    
    # notice the keyby
    dt[, unlist(vars), by = id][!V1 %in% 'a', .(vars = list(V1)), keyby = id][dt, nomatch = 0]
    #   id vars i.vars othercol
    #1:  2  b,c  a,b,c        6
    #2:  3  b,c    b,c        7
    

提交回复
热议问题