I am working on generic function which will take dataframe and return all the outliers for every variable in a dataframe and then remove it.
outliers <-
We create a function by selecting only the numeric
columns (select_if
), loop through those columns (map
) and subset the elements that are not outliers. This will output as a list
of vector
s.
library(dplyr)
library(tidyr)
library(purrr)
outlierremoval <- function(dataframe){
dataframe %>%
select_if(is.numeric) %>% #selects on the numeric columns
map(~ .x[!.x %in% boxplot.stats(.)$out]) #%>%
# not clear whether we need to output as a list or data.frame
# if it is the latter, the columns could be of different length
# so we may use cbind.fill
# { do.call(rowr::cbind.fill, c(., list(fill = NA)))}
}
outlierremoval(Clean_Data)
If we want to keep all the other columns, then use map_if
and append with NA at the end using cbind.fill
to create a data.frame output. But, this will also result in change of position of rows in each column based on the number of outliers
outlierremoval <- function(dataframe){
dataframe %>%
map_if(is.numeric, ~ .x[!.x %in% boxplot.stats(.)$out]) %>%
{ do.call(rowr::cbind.fill, c(., list(fill = NA)))} %>%
set_names(names(dataframe))
}
res <- outlierremoval(Clean_Data)
head(res)
# X Observation Dist_Taxi Dist_Market Dist_Hospital Carpet Builtup Parking City_Category Rainfall House_Price
#1 1 1 9796 5250 10703 1659 1961 Open CAT B 530 6649000
#2 2 2 8294 8186 12694 1461 1752 Not Provided CAT B 210 3982000
#3 3 3 11001 14399 16991 1340 1609 Not Provided CAT A 720 5401000
#4 4 4 8301 11188 12289 1451 1748 Covered CAT B 620 5373000
#5 5 5 10510 12629 13921 1770 2111 Not Provided CAT B 450 4662000
#6 6 6 6665 5142 9972 1442 1733 Open CAT B 760 4526000
If we need to get the outliers, in the map
step we extract the outlier
from the boxplot.stats
outliers <- function(dataframe){
dataframe %>%
select_if(is.numeric) %>%
map(~ boxplot.stats(.x)$out)
}
outliers(Clean_Data)
Or to replace the outliers with NA
(which will also preserve the row positions)
outlierreplacement <- function(dataframe){
dataframe %>%
map_if(is.numeric, ~ replace(.x, .x %in% boxplot.stats(.x)$out, NA)) %>%
bind_cols
}
outlierreplacement(Clean_Data)