How to get outliers for all the columns in a dataframe in r

后端 未结 1 1188
旧时难觅i
旧时难觅i 2020-12-19 21:56

I am working on generic function which will take dataframe and return all the outliers for every variable in a dataframe and then remove it.

 outliers <-          


        
相关标签:
1条回答
  • 2020-12-19 22:08

    We create a function by selecting only the numeric columns (select_if), loop through those columns (map) and subset the elements that are not outliers. This will output as a list of vectors.

    library(dplyr)
    library(tidyr)
    library(purrr)
    outlierremoval <- function(dataframe){
     dataframe %>%
          select_if(is.numeric) %>% #selects on the numeric columns
          map(~ .x[!.x %in% boxplot.stats(.)$out]) #%>%
          # not clear whether we need to output as a list or data.frame
          # if it is the latter, the columns could be of different length
          # so we may use cbind.fill
          # { do.call(rowr::cbind.fill, c(., list(fill = NA)))}
    
     }
    
    outlierremoval(Clean_Data)
    

    If we want to keep all the other columns, then use map_if and append with NA at the end using cbind.fill to create a data.frame output. But, this will also result in change of position of rows in each column based on the number of outliers

    outlierremoval <- function(dataframe){
     dataframe %>%          
           map_if(is.numeric, ~ .x[!.x %in% boxplot.stats(.)$out]) %>%
           { do.call(rowr::cbind.fill, c(., list(fill = NA)))} %>%
           set_names(names(dataframe))
         
    
    
    }
    res <- outlierremoval(Clean_Data)
    head(res)
    #  X Observation Dist_Taxi Dist_Market Dist_Hospital Carpet Builtup      Parking City_Category Rainfall House_Price
    #1 1           1      9796        5250         10703   1659    1961         Open         CAT B      530     6649000
    #2 2           2      8294        8186         12694   1461    1752 Not Provided         CAT B      210     3982000
    #3 3           3     11001       14399         16991   1340    1609 Not Provided         CAT A      720     5401000
    #4 4           4      8301       11188         12289   1451    1748      Covered         CAT B      620     5373000
    #5 5           5     10510       12629         13921   1770    2111 Not Provided         CAT B      450     4662000
    #6 6           6      6665        5142          9972   1442    1733         Open         CAT B      760     4526000
    

    Update

    If we need to get the outliers, in the map step we extract the outlier from the boxplot.stats

    outliers <- function(dataframe){
    dataframe %>%
         select_if(is.numeric) %>% 
          map(~ boxplot.stats(.x)$out) 
      
    
      }
    outliers(Clean_Data)
    

    Or to replace the outliers with NA (which will also preserve the row positions)

    outlierreplacement <- function(dataframe){
       dataframe %>%          
               map_if(is.numeric, ~ replace(.x, .x %in% boxplot.stats(.x)$out, NA)) %>%
               bind_cols 
             
    
      
    }
    outlierreplacement(Clean_Data)
    
    0 讨论(0)
提交回复
热议问题