I have a large database made up of mixed data types (numeric, character, factor, ordinal factor) with missing values, and I am trying to create a for loop to substitute the
First, you need to write the mode function taking into consideration the missing values of the Categorical data, which are of length<1.
The mode function:
getmode <- function(v){
v=v[nchar(as.character(v))>0]
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
Then you can iterate of columns and if the column is numeric to fill the missing values with the mean otherwise with the mode.
The loop statement below:
for (cols in colnames(df)) {
if (cols %in% names(df[,sapply(df, is.numeric)])) {
df<-df%>%mutate(!!cols := replace(!!rlang::sym(cols), is.na(!!rlang::sym(cols)), mean(!!rlang::sym(cols), na.rm=TRUE)))
}
else {
df<-df%>%mutate(!!cols := replace(!!rlang::sym(cols), !!rlang::sym(cols)=="", getmode(!!rlang::sym(cols))))
}
}
Let's provide an example:
library(tidyverse)
df<-tibble(id=seq(1,10), ColumnA=c(10,9,8,7,NA,NA,20,15,12,NA),
ColumnB=factor(c("A","B","A","A","","B","A","B","","A")),
ColumnC=factor(c("","BB","CC","BB","BB","CC","AA","BB","","AA")),
ColumnD=c(NA,20,18,22,18,17,19,NA,17,23)
)
df
The initial df with the missing values:
# A tibble: 10 x 5
id ColumnA ColumnB ColumnC ColumnD
1 1 10 "A" "" NA
2 2 9 "B" "BB" 20
3 3 8 "A" "CC" 18
4 4 7 "A" "BB" 22
5 5 NA "" "BB" 18
6 6 NA "B" "CC" 17
7 7 20 "A" "AA" 19
8 8 15 "B" "BB" NA
9 9 12 "" "" 17
10 10 NA "A" "AA" 23
By running the for loop above, we get:
# A tibble: 10 x 5
id ColumnA ColumnB ColumnC ColumnD
1 1 10 A BB 19.2
2 2 9 B BB 20
3 3 8 A CC 18
4 4 7 A BB 22
5 5 11.6 A BB 18
6 6 11.6 B CC 17
7 7 20 A AA 19
8 8 15 B BB 19.2
9 9 12 A BB 17
10 10 11.6 A AA 23
As we can see, the missing values have been imputed. You can see an example here