R: data.table count !NA per row

前端 未结 2 755
予麋鹿
予麋鹿 2020-12-16 14:24

I am trying to count the number of columns that do not contain NA for each row, and place that value into a new column for that row.

Example data:

li         


        
相关标签:
2条回答
  • 2020-12-16 14:38

    Try this one using Reduce to chain together + calls:

    d[, num_obs := Reduce(`+`, lapply(.SD,function(x) !is.na(x)))]
    

    If speed is critical, you can eek out a touch more with Ananda's suggestion to hardcode the number of columns being assessed:

    d[, num_obs := 4 - Reduce("+", lapply(.SD, is.na))]
    

    Benchmarking using Ananda's larger data.table d from above:

    fun1 <- function(indt) indt[, num_obs := rowSums(!is.na(indt))][]
    fun3 <- function(indt) indt[, num_obs := Reduce(`+`, lapply(.SD,function(x) !is.na(x)))][]
    fun4 <- function(indt) indt[, num_obs := 4 - Reduce("+", lapply(.SD, is.na))][]
    
    library(microbenchmark)
    microbenchmark(fun1(copy(d)), fun3(copy(d)), fun4(copy(d)), times=10L)
    
    #Unit: milliseconds
    #          expr      min       lq     mean   median       uq      max neval
    # fun1(copy(d)) 3.565866 3.639361 3.912554 3.703091 4.023724 4.596130    10
    # fun3(copy(d)) 2.543878 2.611745 2.973861 2.664550 3.657239 4.011475    10
    # fun4(copy(d)) 2.265786 2.293927 2.798597 2.345242 3.385437 4.128339    10
    
    0 讨论(0)
  • 2020-12-16 14:41

    The two options that quickly come to mind are:

    d[, num_obs := sum(!is.na(.SD)), by = 1:nrow(d)][]
    d[, num_obs := rowSums(!is.na(d))][]
    

    The first works by creating a "group" of just one row per group (1:nrow(d)). Without that, it would just sum the NA values within the entire table.

    The second makes use of an already very efficient base R function, rowSums.

    Here is a benchmark on larger data:

    set.seed(1)
    nrow = 10000
    ncol = 15
    d <- as.data.table(matrix(sample(c(NA, -5:10), nrow*ncol, TRUE), nrow = nrow, ncol = ncol))
    
    fun1 <- function(indt) indt[, num_obs := rowSums(!is.na(indt))][]
    fun2 <- function(indt) indt[, num_obs := sum(!is.na(.SD)), by = 1:nrow(indt)][]
    
    library(microbenchmark)
    microbenchmark(fun1(copy(d)), fun2(copy(d)))
    # Unit: milliseconds
    #           expr        min         lq       mean     median         uq      max neval
    #  fun1(copy(d))   3.727958   3.906458   5.507632   4.159704   4.475201 106.5708   100
    #  fun2(copy(d)) 584.499120 655.634889 684.889614 681.054752 712.428684 861.1650   100
    

    By the way, the empty [] is just to print the resulting data.table. This is required when you want to return the output from set* functions in "data.table".

    0 讨论(0)
提交回复
热议问题