R: data.table count !NA per row

前端未结

关注

 2  758

I am trying to count the number of columns that do not contain NA for each row, and place that value into a new column for that row.

Example data:

li


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  一个人的身影        
                
              
                            
                2020-12-16 14:38
              
            
            
                                                                       
Try this one using Reduce to chain together + calls:

d[, num_obs := Reduce(`+`, lapply(.SD,function(x) !is.na(x)))]


If speed is critical, you can eek out a touch more with Ananda's suggestion to hardcode the number of columns being assessed:

d[, num_obs := 4 - Reduce("+", lapply(.SD, is.na))]


Benchmarking using Ananda's larger data.table d from above:

fun1 <- function(indt) indt[, num_obs := rowSums(!is.na(indt))][]
fun3 <- function(indt) indt[, num_obs := Reduce(`+`, lapply(.SD,function(x) !is.na(x)))][]
fun4 <- function(indt) indt[, num_obs := 4 - Reduce("+", lapply(.SD, is.na))][]

library(microbenchmark)
microbenchmark(fun1(copy(d)), fun3(copy(d)), fun4(copy(d)), times=10L)

#Unit: milliseconds
#          expr      min       lq     mean   median       uq      max neval
# fun1(copy(d)) 3.565866 3.639361 3.912554 3.703091 4.023724 4.596130    10
# fun3(copy(d)) 2.543878 2.611745 2.973861 2.664550 3.657239 4.011475    10
# fun4(copy(d)) 2.265786 2.293927 2.798597 2.345242 3.385437 4.128339    10

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  清歌不尽        
                
              
                            
                2020-12-16 14:41
              
            
            
                                                                       
The two options that quickly come to mind are:

d[, num_obs := sum(!is.na(.SD)), by = 1:nrow(d)][]
d[, num_obs := rowSums(!is.na(d))][]


The first works by creating a "group" of just one row per group (1:nrow(d)). Without that, it would just sum the NA values within the entire table. 

The second makes use of an already very efficient base R function, rowSums.

Here is a benchmark on larger data:

set.seed(1)
nrow = 10000
ncol = 15
d <- as.data.table(matrix(sample(c(NA, -5:10), nrow*ncol, TRUE), nrow = nrow, ncol = ncol))

fun1 <- function(indt) indt[, num_obs := rowSums(!is.na(indt))][]
fun2 <- function(indt) indt[, num_obs := sum(!is.na(.SD)), by = 1:nrow(indt)][]

library(microbenchmark)
microbenchmark(fun1(copy(d)), fun2(copy(d)))
# Unit: milliseconds
#           expr        min         lq       mean     median         uq      max neval
#  fun1(copy(d))   3.727958   3.906458   5.507632   4.159704   4.475201 106.5708   100
#  fun2(copy(d)) 584.499120 655.634889 684.889614 681.054752 712.428684 861.1650   100




By the way, the empty [] is just to print the resulting data.table. This is required when you want to return the output from set* functions in "data.table".
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复