Fastest way to filter a data.frame list column contents in R / Rcpp

后端 未结 3 1115
长发绾君心
长发绾君心 2020-12-17 02:02

I have a data.frame:

df <- structure(list(id = 1:3, vars = list(\"a\", c(\"a\", \"b\", \"c\"), c(\"b\", 
\"c\"))), .Names = c(\"id\", \"vars\"), row.names         


        
3条回答
  •  生来不讨喜
    2020-12-17 02:51

    Here's another way:

    # prep
    DT <- data.table(df)
    DT[,vstr:=paste0(sort(unlist(vars)),collapse="_"),by=1:nrow(DT)]
    setkey(DT,vstr)
    
    get_badkeys <- function(x) 
      unlist(sapply(1:length(x),function(n) combn(sort(x),n,paste0,collapse="_")))
    
    # choose values to exclude
    baduns  <- c("a","b")
    
    # subset
    DT[!J(get_badkeys(baduns))]
    

    This is fairly fast, but it takes up your key.


    Benchmarks. Here's a made-up example:

    Candidates:

    hannahh <- function(df,baduns){
        df %>% 
            mutate(vars = lapply(.$vars, setdiff, baduns)) %>% 
            filter(!!sapply(vars,length))
    }
    eddi    <- function(df,baduns){
            dt = as.data.table(df)
            dt[, 
              unlist(vars)
            , by = id][!V1 %in% baduns, 
              .(vars = list(V1))
            , keyby = id][dt, nomatch = 0]
    }   
    stevenb <- function(df,baduns){
        df %>% 
          rowwise() %>% 
          do(id = .$id, vars = .$vars, newcol = setdiff(.$vars, baduns)) %>%
          mutate(length = length(newcol)) %>%
          ungroup() %>%
          filter(length > 0)
    }
    frank   <- function(df,baduns){
        DT <- data.table(df)
        DT[,vstr:=paste0(sort(unlist(vars)),collapse="_"),by=1:nrow(DT)]
        setkey(DT,vstr)
        DT[!J(get_badkeys(baduns))]
    }
    

    Simulation:

    nvals  <- 4
    nbads  <- 2
    maxlen <- 4
    
    nobs   <- 1e4
    
    exdf   <- data.table(
      id=1:nobs,
      vars=replicate(nobs,list(sample(valset,sample(maxlen,1))))
    )
    setDF(exdf)
    baduns <- valset[1:nbads]
    

    Results:

    system.time(frank_res   <- frank(exdf,baduns))
    #   user  system elapsed 
    #   0.24    0.00    0.28 
    system.time(hannahh_res <- hannahh(exdf,baduns))
    #   0.42    0.00    0.42
    system.time(eddi_res    <- eddi(exdf,baduns))
    #   0.05    0.00    0.04
    system.time(stevenb_res <- stevenb(exdf,baduns))
    #   36.27   55.36   93.98
    

    Checks:

    identical(sort(frank_res$id),eddi_res$id) # TRUE
    identical(unlist(stevenb_res$id),eddi_res$id) # TRUE
    identical(unlist(hannahh_res$id),eddi_res$id) # TRUE
    

    Discussion:

    For eddi() and hannahh(), the results scarcely change with nvals, nbads and maxlen. In contrast, when baduns goes over 20, frank() becomes incredibly slow (like 20+ sec); it also scales up with nbads and maxlen a little worse than the other two.

    Scaling up nobs, eddi()'s lead over hannahh() stays the same, at about 10x. Against frank(), it sometimes shrinks and sometimes stays the same. In the best nobs = 1e5 case for frank(), eddi() is still 3x faster.

    If we switch from a valset of characters to something that frank() must coerce to a character for its by-row paste0 operation, both eddi() and hannahh() beat it as nobs grows.


    Benchmarks for doing this repeatedly. This is probably obvious, but if you have to do this "many" times (...how many is hard to say), it's better to create the key column than to go through the subsetting for each set of baduns. In the simulation above, eddi() is about 5x as fast as frank(), so I'd go for the latter if I was doing this subsetting 10+ times.

    maxbadlen    <- 2
    set_o_baduns <- replicate(10,sample(valset,size=sample(maxbadlen,1)))
    
    system.time({
        DT <- data.table(exdf)
        DT[,vstr:=paste0(sort(unlist(vars)),collapse="_"),by=1:nrow(DT)]
        setkey(DT,vstr)
    
        for (i in 1:10) DT[!J(get_badkeys(set_o_baduns[[i]]))]
    })
    # user  system elapsed 
    # 0.29    0.00    0.29
    
    system.time({
        dt = as.data.table(exdf)
        for (i in 1:10) dt[, 
          unlist(vars), by = id][!V1 %in% set_o_baduns[[i]],
          .(vars = list(V1)), keyby = id][dt, nomatch = 0]
    })
    # user  system elapsed 
    # 0.39    0.00    0.39
    
    system.time({
        for (i in 1:10) hannahh(exdf,set_o_baduns[[i]])
    })
    # user  system elapsed 
    # 4.10    0.00    4.13
    

    So, as expected, frank() takes very little time for additional evaluations, while eddi() and hannahh() grow linearly.

提交回复
热议问题