Interpolate NA values in a data frame with na.approx

余生长醉 提交于 2019-11-27 12:10:05

A small, reproducible example:

library(zoo)
set.seed(1)
m <- matrix(runif(16, 0, 100), nrow = 4)
missing_values <- sample(16, 7)
m[missing_values] <- NA
m
         [,1]     [,2]      [,3]     [,4]
[1,] 26.55087 20.16819 62.911404 68.70228
[2,] 37.21239       NA  6.178627 38.41037
[3,]       NA       NA        NA       NA
[4,] 90.82078 66.07978        NA       NA

na.approx(m)
         [,1]     [,2]      [,3]     [,4]
[1,] 26.55087 20.16819 62.911404 68.70228
[2,] 37.21239 35.47206  6.178627 38.41037
[3,] 64.01658 50.77592        NA       NA
[4,] 90.82078 66.07978        NA       NA

m[4, 4] <- 50
na.approx(m)
         [,1]     [,2]      [,3]     [,4]
[1,] 26.55087 20.16819 62.911404 68.70228
[2,] 37.21239 35.47206  6.178627 38.41037
[3,] 64.01658 50.77592        NA 44.20519
[4,] 90.82078 66.07978        NA 50.00000

Yup, looks like you do need the start/end values of columns to be known or the interpolation doesn't work. Can you guess values for your boundaries?

ANOTHER EDIT: So by default, you need the start and end values of columns to be known. However it is possible to get na.approx to always fill in the blanks by passing rule = 2. See Felix's answer. You can also use na.fill to provide a default value, as per Gabor's comment. Finally, you can interpolate boundary conditions in two directions (see below) or guess boundary conditions.


EDIT: A further thought. Since na.approx is only interpolating in columns, and your data is spacial, perhaps interpolating in rows would be useful too. Then you could take the average.

na.approx fails when whole columns are NA, so we create a bigger dataset.

set.seed(1)
m <- matrix(runif(64, 0, 100), nrow = 8)
missing_values <- sample(64, 15)
m[missing_values] <- NA

Run na.approx both ways.

by_col <- na.approx(m)
by_row <- t(na.approx(t(m)))

Find out the best guess.

default <- 50
best_guess <- ifelse(is.na(by_row), 
  ifelse(
    is.na(by_col), 
    default,              #neither known
    by_col                #only by_col known
  ), 
  ifelse(
    is.na(by_col), 
    by_row,               #only by_row known
    (by_row + by_col) / 2 #both known
  )
)

na.approx() follows the approx() function in only interpolating values, not extrapolating them, by default. However, as described in the help page for approx(), you can specify rule = 2 to extrapolate as a constant value of the nearest extreme. Following on from Richie Cotton's example:

na.approx(m, rule = 2)
         [,1]     [,2]      [,3]     [,4]
[1,] 26.55087 20.16819 62.911404 68.70228
[2,] 37.21239 35.47206  6.178627 38.41037
[3,] 64.01658 50.77592  6.178627 38.41037
[4,] 90.82078 66.07978  6.178627 38.41037

Equivalently, you can use "last observation carry forward" explicitly.

na.locf(na.approx(m))
## "first observation carry backwards" too:
na.locf(na.locf(na.approx(m)), fromLast = TRUE)
Henrik

I think you should try to set na.rm=TRUE

From the docs

na.rm logical. Should leading NAs be removed?

http://www.oga-lab.net/RGM2/func.php?rd_id=zoo:na.approx

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!