R: data.table count !NA per row

前端 未结 2 760
予麋鹿
予麋鹿 2020-12-16 14:24

I am trying to count the number of columns that do not contain NA for each row, and place that value into a new column for that row.

Example data:

li         


        
2条回答
  •  清歌不尽
    2020-12-16 14:41

    The two options that quickly come to mind are:

    d[, num_obs := sum(!is.na(.SD)), by = 1:nrow(d)][]
    d[, num_obs := rowSums(!is.na(d))][]
    

    The first works by creating a "group" of just one row per group (1:nrow(d)). Without that, it would just sum the NA values within the entire table.

    The second makes use of an already very efficient base R function, rowSums.

    Here is a benchmark on larger data:

    set.seed(1)
    nrow = 10000
    ncol = 15
    d <- as.data.table(matrix(sample(c(NA, -5:10), nrow*ncol, TRUE), nrow = nrow, ncol = ncol))
    
    fun1 <- function(indt) indt[, num_obs := rowSums(!is.na(indt))][]
    fun2 <- function(indt) indt[, num_obs := sum(!is.na(.SD)), by = 1:nrow(indt)][]
    
    library(microbenchmark)
    microbenchmark(fun1(copy(d)), fun2(copy(d)))
    # Unit: milliseconds
    #           expr        min         lq       mean     median         uq      max neval
    #  fun1(copy(d))   3.727958   3.906458   5.507632   4.159704   4.475201 106.5708   100
    #  fun2(copy(d)) 584.499120 655.634889 684.889614 681.054752 712.428684 861.1650   100
    

    By the way, the empty [] is just to print the resulting data.table. This is required when you want to return the output from set* functions in "data.table".

提交回复
热议问题