Extract last non-missing value in row with data.table

前端 未结 5 898
囚心锁ツ
囚心锁ツ 2021-01-04 04:03

I have a data.table of factor columns, and I want to pull out the label of the last non-missing value in each row. It\'s kindof a typical max.col situation, bu

5条回答
  •  南笙
    南笙 (楼主)
    2021-01-04 04:42

    Here's another way:

    dat[, res := NA_character_]
    for (v in rev(names(dat))[-1]) dat[is.na(res), res := get(v)]
    
    
       X1 X2 X3 X4 X5 res
    1:  u NA NA NA NA   u
    2:  f  q NA NA NA   q
    3:  f  b  w NA NA   w
    4:  k  g  h NA NA   h
    5:  u  b  r NA NA   r
    6:  f  q  w  x  t   t
    7:  u  g  h  i  e   e
    8:  u  q  r  n  t   t
    

    Benchmarks Using the same data as @alexis_laz and making (apparently) superficial changes to the functions, I see different results. Just showing them here in case anyone is curious. Alexis' answer (with small modifications) still comes out ahead.

    Functions:

    alex = function(x, ans = rep_len(NA, length(x[[1L]])), wh = seq_len(length(x[[1L]]))){
        if(!length(wh)) return(ans)
        ans[wh] = as.character(x[[length(x)]])[wh]
        Recall(x[-length(x)], ans, wh[is.na(ans[wh])])
    }   
    
    alex2 = function(x){
        x[, res := NA_character_]
        wh = x[, .I]
        for (v in (length(x)-1):1){
          if (!length(wh)) break
          set(x, j="res", i=wh, v = x[[v]][wh])
          wh = wh[is.na(x$res[wh])]
        }
        x$res
    }
    
    frank = function(x){
        x[, res := NA_character_]
        for(v in rev(names(x))[-1]) x[is.na(res), res := get(v)]
        return(x$res)       
    }
    
    frank2 = function(x){
        x[, res := NA_character_]
        for(v in rev(names(x))[-1]) x[is.na(res), res := .SD, .SDcols=v]
        x$res
    }
    

    Example data and benchmark:

    DAT1 = as.data.table(lapply(ceiling(seq(0, 1e4, length.out = 1e2)), 
                         function(n) c(rep(NA, n), sample(letters, 3e5 - n, TRUE))))
    DAT2 = copy(DAT1)
    DAT3 = as.list(copy(DAT1))
    DAT4 = copy(DAT1)
    
    library(microbenchmark)
    microbenchmark(frank(DAT1), frank2(DAT2), alex(DAT3), alex2(DAT4), times = 30)
    
    Unit: milliseconds
             expr       min        lq      mean    median         uq        max neval
      frank(DAT1) 850.05980 909.28314 985.71700 979.84230 1023.57049 1183.37898    30
     frank2(DAT2)  88.68229  93.40476 118.27959 107.69190  121.60257  346.48264    30
       alex(DAT3)  98.56861 109.36653 131.21195 131.20760  149.99347  183.43918    30
      alex2(DAT4)  26.14104  26.45840  30.79294  26.67951   31.24136   50.66723    30
    

提交回复
热议问题