I want to replace NAs present in a column of a DATA TABLE with the mean of the same column. I am doing the following. But it is not working.
ww <- data.ta
na.aggregate
in the zoo package replaces NAs with the mean of the non-NAs in the same column:
library(zoo)
ww[, Sepal.Length := na.aggregate(Sepal.Length)]
In base R:
ww$Sepal.Length[is.na(ww$Sepal.Length)] <- mean(ww$Sepal.Length, na.rm = T)
While the zoo
answer is pretty nice it requires new dependency.
Using just data.table
you could do the following.
library(data.table)
# prepare data
ww = data.table(iris[1:5,])
ww[1, Sepal.Length := NA]
# solution
ww[, Sepal.Length.mean := mean(Sepal.Length, na.rm = TRUE) # calculate mean
][is.na(Sepal.Length), Sepal.Length := Sepal.Length.mean # replace NA with mean
][, Sepal.Length.mean := NULL # remove mean col
][] # just prints
While it may looks biggish comparing to zoo's, it is performance efficient as all steps are made using update by reference :=
.
It can also be easily tuned to replace NA with mean by group, just using by
argument in data.table.
Your attempt subsetted the table first, selecting
> ww[is.na(Sepal.Length)]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1:
NA 3.5 1.4 0.2 setosa
so any further operations can only 'see' these rows - i.e. Sepal.Length
can only see that one NA
.
The data.table
solution you want is below - it looks at the whole table and replaces the NA
s with the means using an ifelse
.
ww[, Sepal.Length := ifelse(is.na(Sepal.Length), mean(Sepal.Length, na.rm = TRUE), Sepal.Length)]
It is not taking the mean of the entire Sepal.Length column; only the 1 column that you have chosen.
Rather use:
ww[is.na(Sepal.Length) , Sepal.Length:= mean(ww$Sepal.Length, na.rm=TRUE)]
tidyr
has a built in function, replace_na
you can use for this:
library(tidyr)
ww %>% replace_na(list(Sepal.Length = mean(.$Sepal.Length, na.rm = TRUE)))