Replacing all missing values in R data.table with a value

前端 未结 4 1648
清歌不尽
清歌不尽 2020-12-04 14:05

If you have an R data.table that has missing values, how do you replace all of them with say, the value 0? E.g.

aa = data.table(V1=1:10,V2=c(1,2,2,3,3,3,4,4,         


        
4条回答
  •  暖寄归人
    2020-12-04 14:38

    is.na (being a primitive) has relatively very less overhead and is usually quite fast. So, you can just loop through the columns and use set to replace NA with0`.

    Using <- to assign will result in a copy of all the columns and this is not the idiomatic way using data.table.

    First I'll illustrate as to how to do it and then show how slow this can get on huge data (due to the copy):

    One way to do this efficiently:

    for (i in seq_along(tt)) set(tt, i=which(is.na(tt[[i]])), j=i, value=0)
    

    You'll get a warning here that "0" is being coerced to character to match the type of column. You can ignore it.

    Why shouldn't you use <- here:

    # by reference - idiomatic way
    set.seed(45)
    tt <- data.table(matrix(sample(c(NA, rnorm(10)), 1e7*3, TRUE), ncol=3))
    tracemem(tt)
    # modifies value by reference - no copy
    system.time({
    for (i in seq_along(tt)) 
        set(tt, i=which(is.na(tt[[i]])), j=i, value=0)
    })
    #   user  system elapsed 
    #  0.284   0.083   0.386 
    
    # by copy - NOT the idiomatic way
    set.seed(45)
    tt <- data.table(matrix(sample(c(NA, rnorm(10)), 1e7*3, TRUE), ncol=3))
    tracemem(tt)
    # makes copy
    system.time({tt[is.na(tt)] <- 0})
    # a bunch of "tracemem" output showing the copies being made
    #   user  system elapsed 
    #  4.110   0.976   5.187 
    

提交回复
热议问题