How to replace NA values in a data.table with na.spline

问题

I'm trying to prepare some demographic data retrieved from Eurostat for further processing, amongst others replacing any missing data with corresponding approximated ones.

First I was using data.frames only, but then I got convinced that data.tables might offer some advantages over regular data.frames, so I migrated to data.tables.

One thing I've observed while doing so was getting different results when using "na.spline" in combination with "apply" versus "na.spline" as part of the data.table.

#1 source data

(dt0 <- data.table(
            "age,sex,geo\\time" = c("TOTAL,F,AD", "TOTAL,F,AL", "TOTAL,F,AM", "TOTAL,F,AT", "TOTAL,F,AZ"),
            "2014" = c(NA,    NA,      NA,      4351253, NA),
            "2013" = c(37408, NA,      NA,      4328238, 4707690),
            "2012" = c(38252, NA,      1684000, 4309977, 4651601),
            "2011" = c(38252, 1409931, 1679066, 4296293, 4594023),
            "2010" = c(40296, NA,      1673656, 4285442, 4542083)
        ))

Which generates

       age,sex,geo\\time    2014    2013    2012    2011    2010
    1:        TOTAL,F,AD      NA   37408   38252   38252   40296
    2:        TOTAL,F,AL      NA      NA      NA 1409931      NA
    3:        TOTAL,F,AM      NA      NA 1684000 1679066 1673656
    4:        TOTAL,F,AT 4351253 4328238 4309977 4296293 4285442
    5:        TOTAL,F,AZ      NA 4707690 4651601 4594023 4542083

Split into "row headings"...

(dt0a <- dt0[, 1, with=FALSE])
(cn0a <- colnames(dt0a))

...and population data...

(dt0b <- dt0[, 2:ncol(dt0), with=FALSE])
(cn0ba <- colnames(dt0b))

#2 replace NAs by "na.spline" & "apply"

(dt1b <- data.table(t(apply(dt0b, 1, na.spline, na.rm=FALSE))))
(setnames(dt1b, cn0b))
(dt1 <- cbind(dt0a, dt1b))

Which results in...

       age,sex,geo\\time    2014    2013    2012    2011    2010
    1:        TOTAL,F,AD   32832   37408   38252   38252   40296
    2:        TOTAL,F,AL 1409931 1409931 1409931 1409931 1409931
    3:        TOTAL,F,AM 1692440 1688458 1684000 1679066 1673656
    4:        TOTAL,F,AT 4351253 4328238 4309977 4296293 4285442
    5:        TOTAL,F,AZ 4755163 4707690 4651601 4594023 4542083

#3 replace NAs within "data.table"

(dt2b <- dt0b[,lapply(.SD, na.spline, na.rm=FALSE)])
(dt2 <- cbind(dt0a, dt2b))

Et voila...

       age,sex,geo\\time    2014    2013      2012    2011      2010
    1:        TOTAL,F,AD 4351253   37408   38252.0   38252   40296.0
    2:        TOTAL,F,AL 4351253 1993097 -611513.8 1409931 -629363.2
    3:        TOTAL,F,AM 4351253 3423374 1684000.0 1679066 1673656.0
    4:        TOTAL,F,AT 4351253 4328238 4309977.0 4296293 4285442.0
    5:        TOTAL,F,AZ 4351253 4707690 4651601.0 4594023 4542083.0

#4 compare results

(identical(dt1, dt2))

Considering the aforementioned not quite a surprise...

    [1] FALSE

(The values calculated as a replacement for NAs with approach #2 are the ones I'd be interested in, only generated via approach #3).

Reason for pursuing the "data.table" route (approach #3) is one of performance (as it has been pointed out in various posts, when using "apply" a matrix operation is getting carried out, which is taking a considerably longer time than a corresponding approach facilitating "data.table" only).

Being pretty new to R I reckon I've done something quite fundamentally wrong, only thing is, I haven't got the faintest idea what this might be.

Any help pointing me in the right direction is more than appreciated!

-Sil68

回答1:

Using a matrix. Using a matrix operation on a matrix is not slow:

mat           <- t(as.matrix(dt0[,-1]))
colnames(mat) <- dt0[[1]]
mat[]         <- na.spline(mat,na.rm=FALSE)

which gives

     TOTAL,F,AD TOTAL,F,AL TOTAL,F,AM TOTAL,F,AT TOTAL,F,AZ
2014      32832    1409931    1692440    4351253    4755163
2013      37408    1409931    1688458    4328238    4707690
2012      38252    1409931    1684000    4309977    4651601
2011      38252    1409931    1679066    4296293    4594023
2010      40296    1409931    1673656    4285442    4542083

Using a data.table. If you instead want to use a data.table, do

mat           <- t(as.matrix(dt0[,-1]))
colnames(mat) <- dt0[[1]]
DT            <- data.table(mat,keep.rownames=TRUE)
DT[,(vn):=lapply(.SD,na.spline,na.rm=FALSE),.SDcols=vn]

which updates DT by reference, giving

     rn TOTAL,F,AD TOTAL,F,AL TOTAL,F,AM TOTAL,F,AT TOTAL,F,AZ
1: 2014      32832    1409931    1692440    4351253    4755163
2: 2013      37408    1409931    1688458    4328238    4707690
3: 2012      38252    1409931    1684000    4309977    4651601
4: 2011      38252    1409931    1679066    4296293    4594023
5: 2010      40296    1409931    1673656    4285442    4542083

Benchmarking:

mat           <- t(as.matrix(dt0[,-1]))
colnames(mat) <- dt0[[1]]
DT            <- data.table(mat,keep.rownames=TRUE)
vn            <- names(DT)[-1]
tvn           <- names(dt0)[-1]

require(microbenchmark)
microbenchmark(
  transp = dt0[,as.list(na.spline(unlist(.SD), na.rm=FALSE)),by=1:nrow(dt0),.SDcols=tvn],
  lapply = DT[,lapply(.SD,na.spline,na.rm=FALSE),.SDcols=vn],
  apply  = apply(mat,2,na.spline,na.rm=FALSE),
  fun    = na.spline(mat,na.rm=FALSE),
times=10)

Results:

Unit: milliseconds
   expr      min       lq     mean   median       uq      max neval
 transp 4.666934 4.734891 4.850268 4.787690 4.897202 5.259957    10
 lapply 3.923823 4.010356 4.327646 4.039445 4.049957 6.976446    10
  apply 2.505556 2.525601 2.578890 2.585978 2.592090 2.758801    10
    fun 1.945290 1.994178 2.063063 2.068490 2.085112 2.272846    10

The "transp" result shows the timing of @shadow's solution, which preserves the OP's formatting. apply is not necessary here, thanks to how na.spline works.

回答2:

You are doing rowwise operations in your apply version and colwise operations in your data.table version. You can do the rowwise operation in data.table if you set by = 1:nrow(dt).

dt2b <- dt0b[, as.list(na.spline(unlist(.SD), na.rm=FALSE)), by = 1:nrow(dt0b)]

You can also use .SDcols, so that you don't need to split up the data. IF the age,sex,geo\\time is unique, the following will work:

dt0[, as.list(na.spline(unlist(.SD), na.rm=FALSE)), by = `age,sex,geo\\time`, .SDcols = -"age,sex,geo\\time"]
##    age,sex,geo\\time      V1      V2      V3      V4      V5
## 1:        TOTAL,F,AD   32832   37408   38252   38252   40296
## 2:        TOTAL,F,AL 1409931 1409931 1409931 1409931 1409931
## 3:        TOTAL,F,AM 1692440 1688458 1684000 1679066 1673656
## 4:        TOTAL,F,AT 4351253 4328238 4309977 4296293 4285442
## 5:        TOTAL,F,AZ 4755163 4707690 4651601 4594023 4542083

来源：https://stackoverflow.com/questions/30075752/how-to-replace-na-values-in-a-data-table-with-na-spline

标签

data.table

apply

spline