Identify outliers in a dataframe in R

做~自己de王妃 提交于 2020-01-06 06:32:49

问题


Current data frame consists of numerical values. I am identifying outliers in my dataframe column by column, can I identify the outliers in the column at once and remove them in one go? Right now I am changing the values to NA

My Code:

    quantiles<-tapply(var1,names,quantile) 
    minq <- sapply(names, function(x) quantiles[[x]]["25%"])
    maxq <- sapply(names, function(x) quantiles[[x]]["75%"])
    var1[var1<minq | var1>maxq] <- NA

Data.

Data posted by the OP in a comment in dput format.

df1 <-
structure(list(Var1 = c(100.2, 110, 200, 456, 120000), 
var2 = c(NA, 4545L, 45465L, 44422L, 250000L), 
var3 = c(NA, 210000L, 91500L, 215000L, 250000L), 
var4 = c(0.983, 0.44, 0.983, 0.78, 2.23)), 
class = "data.frame", row.names = c(NA, -5L))

回答1:


The following removes the outliers from the dataframe, but the result is a list, not a dataframe, since the resulting vectors are not all of the same length.

df2 <- lapply(df1, function(x){
  qq <- quantile(x, c(0.25, 0.75), na.rm = TRUE)
  x[!is.na(x) & qq[1] <= x & x <= qq[2]]
})

Edit

Following this question by the same @user11368874, the code below is inspired in the first code above and answers that second question.

df3 <- df1
df3[] <- lapply(df1, function(x){
  qq <- quantile(x, c(0.25, 0.75), na.rm = TRUE)
  is.na(x) <-  x < qq[1] | x > qq[2]
  x
})

df3
#  Var1  var2   var3  var4
#1   NA    NA     NA 0.983
#2  110    NA 210000    NA
#3  200 45465     NA 0.983
#4  456 44422 215000 0.780
#5   NA    NA     NA    NA



回答2:


The following function tests, which values in columns are outside of Tukey's fences (outliers below and above the 1st and the 3rd quartile). Then, depending on the user preference, the function removes all rows that contain any value with an outlier or replaces the outliers with NA.

outlier.out <- function(dat, q = c(0.25, 0.75), out = TRUE){
    # create a place for identification of outliers
    tests <- matrix(NA, ncol = ncol(dat), nrow = nrow(dat))
    # test, which cells contain outliers, ignoring existing NA values
    for(i in 1:ncol(dat)){
        qq <- quantile(dat[, i], q, na.rm = TRUE)
        tests[, i] <- sapply(dat[, i] < qq[1] | dat[, i] > qq[2], isTRUE)
    }
    if(out){
        # removes lines with outliers
        dat <- dat[!apply(tests, 1, FUN = any, na.rm = TRUE) ,]
    } else {
        # replaces outliers with NA
        dat[tests] <- NA
    }
    return(dat)
}

outlier.out(df1)
#   Var1  var2   var3 var4
# 4  456 44422 215000 0.78


outlier.out(df1, out = FALSE)
#   Var1  var2   var3  var4
# 1   NA    NA     NA 0.983
# 2  110    NA 210000    NA
# 3  200 45465     NA 0.983
# 4  456 44422 215000 0.780
# 5   NA    NA     NA    NA


来源:https://stackoverflow.com/questions/56629367/identify-outliers-in-a-dataframe-in-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!