How to replace outliers with the 5th and 95th percentile values in R

前端 未结 4 1089
野性不改
野性不改 2020-12-30 11:26

I\'d like to replace all values in my relatively large R dataset which take values above the 95th and below the 5th percentile, with those percentile values

相关标签:
4条回答
  • 2020-12-30 12:05

    You can do it in one line of code using squish():

    d2 <- squish(d, quantile(d, c(.05, .95)))
    



    In the scales library, look at ?squish and ?discard

    #--------------------------------
    library(scales)
    
    pr <- .95
    q  <- quantile(d, c(1-pr, pr))
    d2 <- squish(d, q)
    #---------------------------------
    
    # Note: depending on your needs, you may want to round off the quantile, ie:
    q <- round(quantile(d, c(1-pr, pr)))
    

    example:

    d <- 1:20
    d
    # [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
    
    
    d2 <- squish(d, round(quantile(d, c(.05, .95))))
    d2
    # [1]  2  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 19
    
    0 讨论(0)
  • 2020-12-30 12:10

    There is a better way to solve this problem. An outlier is not any point over the 95th percentile or below the 5th percentile. Instead, an outlier is considered so if it is below the first quartile – 1.5·IQR or above third quartile + 1.5·IQR.
    This website will explain in more thoroughly

    To know more about outlier treatment refer here

    capOutlier <- function(x){
       qnt <- quantile(x, probs=c(.25, .75), na.rm = T)
       caps <- quantile(x, probs=c(.05, .95), na.rm = T)
       H <- 1.5 * IQR(x, na.rm = T)
       x[x < (qnt[1] - H)] <- caps[1]
       x[x > (qnt[2] + H)] <- caps[2]
       return(x)
    }
    df$colName=capOutlier(df$colName)
    Do the above line over and over for all of the columns in your data frame
    
    0 讨论(0)
  • 2020-12-30 12:15

    This would do it.

    fun <- function(x){
        quantiles <- quantile( x, c(.05, .95 ) )
        x[ x < quantiles[1] ] <- quantiles[1]
        x[ x > quantiles[2] ] <- quantiles[2]
        x
    }
    fun( yourdata )
    
    0 讨论(0)
  • 2020-12-30 12:16

    I used this code to get what you need:

    qn = quantile(df$value, c(0.05, 0.95), na.rm = TRUE)
    df = within(df, { value = ifelse(value < qn[1], qn[1], value)
                      value = ifelse(value > qn[2], qn[2], value)})
    

    where df is your data.frame, and value the column that contains your data.

    0 讨论(0)
提交回复
热议问题