I am trying to count the number of columns that do not contain NA for each row, and place that value into a new column for that row.
Example data:
li
Try this one using Reduce
to chain together +
calls:
d[, num_obs := Reduce(`+`, lapply(.SD,function(x) !is.na(x)))]
If speed is critical, you can eek out a touch more with Ananda's suggestion to hardcode the number of columns being assessed:
d[, num_obs := 4 - Reduce("+", lapply(.SD, is.na))]
Benchmarking using Ananda's larger data.table d
from above:
fun1 <- function(indt) indt[, num_obs := rowSums(!is.na(indt))][]
fun3 <- function(indt) indt[, num_obs := Reduce(`+`, lapply(.SD,function(x) !is.na(x)))][]
fun4 <- function(indt) indt[, num_obs := 4 - Reduce("+", lapply(.SD, is.na))][]
library(microbenchmark)
microbenchmark(fun1(copy(d)), fun3(copy(d)), fun4(copy(d)), times=10L)
#Unit: milliseconds
# expr min lq mean median uq max neval
# fun1(copy(d)) 3.565866 3.639361 3.912554 3.703091 4.023724 4.596130 10
# fun3(copy(d)) 2.543878 2.611745 2.973861 2.664550 3.657239 4.011475 10
# fun4(copy(d)) 2.265786 2.293927 2.798597 2.345242 3.385437 4.128339 10
The two options that quickly come to mind are:
d[, num_obs := sum(!is.na(.SD)), by = 1:nrow(d)][]
d[, num_obs := rowSums(!is.na(d))][]
The first works by creating a "group" of just one row per group (1:nrow(d)
). Without that, it would just sum the NA
values within the entire table.
The second makes use of an already very efficient base R function, rowSums
.
Here is a benchmark on larger data:
set.seed(1)
nrow = 10000
ncol = 15
d <- as.data.table(matrix(sample(c(NA, -5:10), nrow*ncol, TRUE), nrow = nrow, ncol = ncol))
fun1 <- function(indt) indt[, num_obs := rowSums(!is.na(indt))][]
fun2 <- function(indt) indt[, num_obs := sum(!is.na(.SD)), by = 1:nrow(indt)][]
library(microbenchmark)
microbenchmark(fun1(copy(d)), fun2(copy(d)))
# Unit: milliseconds
# expr min lq mean median uq max neval
# fun1(copy(d)) 3.727958 3.906458 5.507632 4.159704 4.475201 106.5708 100
# fun2(copy(d)) 584.499120 655.634889 684.889614 681.054752 712.428684 861.1650 100
By the way, the empty []
is just to print the resulting data.table
. This is required when you want to return the output from set*
functions in "data.table".