ignore NA in dplyr row sum

后端 未结 6 2107
自闭症患者
自闭症患者 2020-11-27 17:27

is there an elegant way to handle NA as 0 (na.rm = TRUE) in dplyr?

data <- data.frame(a=c(1,2,3,4), b=c(4,NA,5,6), c=c(7,8,9,NA))

data %>% mutate(sum          


        
6条回答
  •  离开以前
    2020-11-27 17:45

    Here's a similar approach to Steven's, but includes dplyr::select() to explicitly state which columns to include/ignore (like ID variables).

    data %>% 
      mutate(sum = rowSums(dplyr::select(., a, b, c), na.rm = TRUE))
    

    It has comparable performance with a realistically-sized dataset. I'm not sure why though, since no columns are actually being excluded in this skinny example.

    Bigger dataset of 1M rows:

    pick <- function() { sample(c(1:5, NA), 1000000, replace=T) }
    data <- data.frame(a=pick(), b=pick(), c=pick())
    

    Results:

    Unit: milliseconds
         expr         min          lq        mean      median          uq         max neval cld
       steven    22.05847    22.96164    56.84822    28.85411    54.99691   174.58447    10 a  
    wibeasley    25.10274    26.98303    30.66911    29.30630    30.63343    49.46048    10 a  
          lyz 10408.89904 10548.33756 10887.51930 10720.92372 11017.56256 12250.41370    10   c
          nar  1975.35941  2011.36445  2123.81705  2090.43174  2172.80501  2362.13658    10  b 
        akrun    31.27247    35.41943    81.33320    57.93900    63.59119   302.21059    10 a  
        frank    37.48265    38.72270    65.02965    41.62735    44.45775   261.79898    10 a  
    

提交回复
热议问题