NA in data.table

狂风中的少年 提交于 2019-12-04 03:23:25

From ?NA

NA is a logical constant of length 1 which contains a missing value indicator. NA can be coerced to any other vector type except raw. There are also constants NA_integer_, NA_real_, NA_complex_ and NA_character_ of the other atomic vector types which support missing values: all of these are reserved words in the R language.

You will have to specify the correct type for your function to work -

You can coerce within the function to match the type of x (note we need any for this to work for situations with more than 1 row in a subset!

f <- function(x) {if any((x==9)) {return(as(NA, class(x)))} else { return(x)}}

More data.table*ish* approach

It might make more data.table sense to use set (or :=) to set / replace by reference.

set(dtb, i = which(dtb[,a]==9), j = 'a', value=NA_integer_)

Or := within [ using a vector scan for a==9

dtb[a == 9, a := NA_integer_]

Or := along with a binary search

setkeyv(dtb, 'a')
dtb[J(9), a := NA_integer_] 

Useful to note

If you use the := or set approaches, you don't appear to need to specify the NA type

Both the following will work

dtb <- data.table(a=1:10)
setkeyv(dtb,'a')
dtb[a==9,a := NA]

dtb <- data.table(a=1:10)
setkeyv(dtb,'a')
set(dtb, which(dtb[,a] == 9), 'a', NA)

This gives a very useful error message that lets you know the reason and solution:

Error in [.data.table(DTc, J(9), :=(a, NA)) : Type of RHS ('logical') must match LHS ('integer'). To check and coerce would impact performance too much for the fastest cases. Either change the type of the target column, or coerce the RHS of := yourself (e.g. by using 1L instead of 1)


Which is quickest

with a reasonable large data.set where a is replaced in situ

Replace in situ

library(data.table)

set.seed(1)
n <- 1e+07
DT <- data.table(a = sample(15, n, T))
setkeyv(DT, "a")
DTa <- copy(DT)
DTb <- copy(DT)
DTc <- copy(DT)
DTd <- copy(DT)
DTe <- copy(DT)

f <- function(x) {
    if (any(x == 9)) {
        return(as(NA, class(x)))
    } else {
        return(x)
    }
}

system.time({DT[a == 9, `:=`(a, NA_integer_)]})
##    user  system elapsed 
##    0.95    0.24    1.20 
system.time({DTa[a == 9, `:=`(a, NA)]})
##    user  system elapsed 
##    0.74    0.17    1.00 
system.time({DTb[J(9), `:=`(a, NA_integer_)]})
##    user  system elapsed 
##    0.02    0.00    0.02 
system.time({set(DTc, which(DTc[, a] == 9), j = "a", value = NA)})
##    user  system elapsed 
##    0.49    0.22    0.67 
system.time({set(DTc, which(DTd[, a] == 9), j = "a", value = NA_integer_)})
##    user  system elapsed 
##    0.54    0.06    0.58 
system.time({DTe[, `:=`(a, f(a)), by = a]})
##    user  system elapsed 
##    0.53    0.12    0.66 
# The are all the same!
all(identical(DT, DTa), identical(DT, DTb), identical(DT, DTc), identical(DT, 
    DTd), identical(DT, DTe))
## [1] TRUE

Unsurprisingly the binary search approach is the fastest

you can also do something like this :

dtb <- data.table(a=1:10)

mat <- ifelse(dtb == 9,NA,dtb$a)

The above command will give you matrix but you can change it back to data.table

new.dtb <- data.table(mat)
new.dtb
     a
 1:   1
 2:   2
 3:   3
 4:   4
 5:   5
 6:   6
 7:   7
 8:   8
 9:  NA
10:  10

Hope this helps.

sdaza

If you want to assign NAs to many variables, you could use the approach suggested here:

v_1  <- c(0,0,1,2,3,4,4,99)
v_2  <- c(1,2,2,2,3,99,1,0)
dat  <-  data.table(v_1,v_2)

for(n in 1:2) {
  chari <-  paste0(sprintf('v_%s' ,n), ' %in% c(0,99)')
  charj <- sprintf('v_%s := NA_integer_', n)
  dat[eval(parse(text=chari)), eval(parse(text=charj))]
}
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!