Replace mean or mode for missing values in R

后端 未结 2 1169
终归单人心
终归单人心 2020-12-18 13:09

I have a large database made up of mixed data types (numeric, character, factor, ordinal factor) with missing values, and I am trying to create a for loop to substitute the

2条回答
  •  失恋的感觉
    2020-12-18 13:50

    First, you need to write the mode function taking into consideration the missing values of the Categorical data, which are of length<1.
    The mode function:

    getmode <- function(v){
      v=v[nchar(as.character(v))>0]
      uniqv <- unique(v)
      uniqv[which.max(tabulate(match(v, uniqv)))]
    }
    

    Then you can iterate of columns and if the column is numeric to fill the missing values with the mean otherwise with the mode.

    The loop statement below:

    for (cols in colnames(df)) {
      if (cols %in% names(df[,sapply(df, is.numeric)])) {
        df<-df%>%mutate(!!cols := replace(!!rlang::sym(cols), is.na(!!rlang::sym(cols)), mean(!!rlang::sym(cols), na.rm=TRUE)))
    
      }
      else {
    
        df<-df%>%mutate(!!cols := replace(!!rlang::sym(cols), !!rlang::sym(cols)=="", getmode(!!rlang::sym(cols))))
    
      }
    }
    

    Let's provide an example:

    library(tidyverse)
    
    df<-tibble(id=seq(1,10), ColumnA=c(10,9,8,7,NA,NA,20,15,12,NA), 
               ColumnB=factor(c("A","B","A","A","","B","A","B","","A")),
               ColumnC=factor(c("","BB","CC","BB","BB","CC","AA","BB","","AA")),
               ColumnD=c(NA,20,18,22,18,17,19,NA,17,23)
               )
    
    df
    

    The initial df with the missing values:

    # A tibble: 10 x 5
          id ColumnA ColumnB ColumnC ColumnD
                   
     1     1      10 "A"     ""           NA
     2     2       9 "B"     "BB"         20
     3     3       8 "A"     "CC"         18
     4     4       7 "A"     "BB"         22
     5     5      NA ""      "BB"         18
     6     6      NA "B"     "CC"         17
     7     7      20 "A"     "AA"         19
     8     8      15 "B"     "BB"         NA
     9     9      12 ""      ""           17
    10    10      NA "A"     "AA"         23
    

    By running the for loop above, we get:

    # A tibble: 10 x 5
          id ColumnA ColumnB ColumnC ColumnD
                   
     1     1    10   A       BB         19.2
     2     2     9   B       BB         20  
     3     3     8   A       CC         18  
     4     4     7   A       BB         22  
     5     5    11.6 A       BB         18  
     6     6    11.6 B       CC         17  
     7     7    20   A       AA         19  
     8     8    15   B       BB         19.2
     9     9    12   A       BB         17  
    10    10    11.6 A       AA         23 
    

    As we can see, the missing values have been imputed. You can see an example here

提交回复
热议问题