calculating the outliers in R

前端 未结 5 1623
慢半拍i
慢半拍i 2020-12-05 01:21

I have a data frame like this:

x

Team 01/01/2012  01/02/2012  01/03/2012  01/01/2012 01/04/2012 SD Mean
A     100         50           40        NA           


        
相关标签:
5条回答
  • 2020-12-05 01:54

    I have seen that you've asked some questions on doing things by row. You should avoid that. R follows the concept that columns represent variables and rows represent observations. Many functions are optimized according to this concept. If you need a wide or transposed output to a file you can rearrange your data just before writing to the file.

    I assume that your data actually looks as shown in the question, but that you have more than one row.

    df <- read.table(text="Team 01/01/2012  01/02/2012  01/03/2012  01/01/2012 01/04/2012 SD 
    
    Mean
    A     100         50           40        NA         30       60  80
    B     200         40           5         8          NA       NA  NA",check.names = FALSE,header=TRUE)
    
    #needed because one date appears twice
    df <- df[,]
    
    #reshape the data
    library(reshape2)
    df <- melt(df,id="Team")
    names(df)[2] <- "Date"
    
    #remove the SD and Mean
    df <- df[!df$Date %in% c("SD","Mean"),]
    
    #function to detect outliers
    outfun <- function(x) {
      abs(x-mean(x,na.rm=TRUE)) > 3*sd(x,na.rm=TRUE)
    }
    
    #test if function works
    outfun(c(200,rnorm(10)))
    
    #use function over all data
    df3$outlier.all <- outfun(df3$value)
    
    #apply function for each team 
    library(plyr)
    df3 <- ddply(df3,.(Team),transform,outlier.team=outfun(value))
    

    Result:

               Date Team value outlier.all outlier.team
    1    01/01/2012    A   100       FALSE        FALSE
    2    01/02/2012    A    50       FALSE        FALSE
    3    01/03/2012    A    40       FALSE        FALSE
    4  01/01/2012.1    A    NA          NA           NA
    5    01/04/2012    A    30       FALSE        FALSE
    6    01/01/2012    B   200       FALSE        FALSE
    7    01/02/2012    B    40       FALSE        FALSE
    8    01/03/2012    B     5       FALSE        FALSE
    9  01/01/2012.1    B     8       FALSE        FALSE
    10   01/04/2012    B    NA          NA           NA
    
    0 讨论(0)
  • 2020-12-05 02:02

    Get your IQR (Interquartile range) and lower/upper quartile using:

    lowerq = quantile(data)[2]
    upperq = quantile(data)[4]
    iqr = upperq - lowerq #Or use IQR(data)
    

    Compute the bounds for a mild outlier:

    mild.threshold.upper = (iqr * 1.5) + upperq
    mild.threshold.lower = lowerq - (iqr * 1.5)
    

    Any data point outside (> mild.threshold.upper or < mild.threshold.lower) these values is a mild outlier

    To detect extreme outliers do the same, but multiply by 3 instead:

    extreme.threshold.upper = (iqr * 3) + upperq
    extreme.threshold.lower = lowerq - (iqr * 3)
    

    Any data point outside (> extreme.threshold.upper or < extreme.threshold.lower) these values is an extreme outlier

    Hope this helps

    edit: was accessing 50%, not 75%

    0 讨论(0)
  • 2020-12-05 02:06

    I have used @by0's answer above to create a function that automatically removes outliers. Here is the function and some example code:

    # generate 10 random numbers and 2 'outlier' numbers
    testData <- c(-42,rnorm(10),42)
    
    # show the numbers
    testData
    
    # define a function to remove outliers
    FindOutliers <- function(data) {
      lowerq = quantile(data)[2]
      upperq = quantile(data)[4]
      iqr = upperq - lowerq #Or use IQR(data)
      # we identify extreme outliers
      extreme.threshold.upper = (iqr * 3) + upperq
      extreme.threshold.lower = lowerq - (iqr * 3)
      result <- which(data > extreme.threshold.upper | data < extreme.threshold.lower)
    }
    
    # use the function to identify outliers
    temp <- FindOutliers(testData)
    
    # remove the outliers
    testData <- testData[-temp]
    
    # show the data with the outliers removed
    testData
    
    0 讨论(0)
  • 2020-12-05 02:08

    check out my most sophisticated functions. It has three methods (z mad, iqr), and different processing of outliers (remove, or replace). Plots are available and hacking (trying different methods or thresholds) are possible

    see the example:

    set.seed(1234)
    x = rnorm(10)
    ez.outlier(iris,'Sepal.Length',fill='null',hack=T,cutoff=c(1,2,3),plot=T)
    

    #' univariate outlier cleanup
    #' @description univariate outlier cleanup
    #' @param x a data frame or a vector
    #' @param col colwise processing
    #' \cr        col name
    #' \cr        if x is not a data frame, col is ignored
    #' \cr        could be multiple cols
    #' @param method z score, mad, or IQR (John Tukey)
    #' @param cutoff abs() > cutoff will be treated as outliers. Default/auto values (i.e. if NA):
    #' \cr z 95% of values fall within 1.96, qnorm(0.025,lower.tail=F), or 3
    #' \cr mad 2.5, which is the standard recommendation, or 5.2
    #' \cr iqr 1.5
    #' \cr if multiple values specified, use the first one (an exception is hack=T, during which method and cutoff same length or scalar)
    #' @param hack call mapply to try all method and cutoff (same length or scalar, ie, different methods with 
    #' corresponding cutoff, or same method with different cutoff).
    #' @param plot boxplot and hist before and after outlier processing.
    #' @param fillout how to process outlier, fill with na, mean, median (columnwise for data frame), or 
    #' null --> remove outlier (only for vector or df with single col specified)
    #' @return returns a new data frame or vector. If hack=T, returns nothings
    #' @note univariate outlier approach
    #' The Z-score method relies on the mean and standard deviation of a group of data to measure central
    #' tendency and dispersion. This is troublesome, because the mean and standard deviation are highly
    #' affected by outliers – they are not robust. In fact, the skewing that outliers bring is one of the
    #' biggest reasons for finding and removing outliers from a dataset!
    #' Another drawback of the Z-score method is that it behaves strangely in small datasets – in fact,
    #' the Z-score method will never detect an outlier if the dataset has fewer than 12 items in it.
    #' \cr
    #' \cr
    #' Median absolute deviation, modified z-score. The median and MAD are robust measures of central tendency and dispersion, respectively.
    #' \cr
    #' \cr
    #' Interquartile range method is that, like the modified Z-score method, it uses a robust measure of dispersion.
    #' \cr
    #' @examples
    #' set.seed(1234)
    #' x = rnorm(10)
    #' iris %>% ez.outlier('Sepal.Length',fill='null',hack=T,plot=T)
    #' @export
    ez.outlier = function(x, col=NULL, method=c('z','mad','iqr'), cutoff=NA, fillout=c('na','null','mean','median'), hack=FALSE, plot=FALSE, na.rm=TRUE, print2scr=TRUE) {
        # https://datascienceplus.com/rscript/outlier.R
        # https://cran.r-project.org/web/packages/outliers/index.html
        # https://rpubs.com/hauselin/outliersDetect
    
        if (hack==T){
                # here for programming reason, for mapply,
                # cutoff could not be NULL, use NA, because length(NULL)=0, but length(NA)=1
                mapply(ez.outlier,method=method,cutoff=cutoff,MoreArgs=list(x=x,col=col,hack=F,plot=plot,fillout=fillout,na.rm=na.rm,print2scr=print2scr),SIMPLIFY=F,USE.NAMES=F)
                cat('Hack done! No actual data returned.\n')
                return(invisible(NULL))
        }
    
     method = match.arg(method); fillout =fillout[1]; cutoff=cutoff[1]
    
        if (!is.data.frame(x)) {
            # todropna is a workaround for data frame with single col passed in
    
            x.bak.plot = x; x.replace.na = x; oldNAs = sum(is.na(x.replace.na))
            if (fillout=='na' | fillout=='todropna') {
                replacement = NA
            } else if (fillout=='mean') {
                replacement = mean(x, na.rm=na.rm)
            } else if (fillout=='median') {
                replacement = median(x, na.rm=na.rm)
            } else if (fillout=='null') {
                replacement = NULL
            }
    
            if (method=='z'){
                if(is.na(cutoff)) cutoff = qnorm(0.025,lower.tail=F)
                absz = abs((x - mean(x, na.rm=na.rm))/sd(x, na.rm=na.rm))
                if (!is.null(replacement)) {
                    x[absz > cutoff] <- replacement
                    } else {
                        # if nothing above cutoff, x is untouched
                        if (length(which(absz > cutoff)) > 0) {
                            x = x[-which(absz > cutoff)]
                        }
                    }
                x.replace.na[absz > cutoff] <- NA
            } else if (method=='mad'){
                if(is.na(cutoff)) cutoff = 2.5
                absmad <- abs((x - median(x, na.rm=na.rm))/mad(x, na.rm=na.rm))
                if (!is.null(replacement)) {
                    x[absmad > cutoff] <- replacement
                    } else {
                        if (length(which(absmad > cutoff)) > 0) {
                            x = x[-which(absmad > cutoff)]
                        }
                    }
                x.replace.na[absmad > cutoff] <- NA
            } else if (method=='iqr'){
                # https://stackoverflow.com/a/4788102/2292993
                if(is.na(cutoff)) cutoff = 1.5
                q1 <- quantile(x, 0.25, na.rm=na.rm)
                q3 <- quantile(x, 0.75, na.rm=na.rm)
                # alternatively iqr = q3-q1
                iqr = IQR(x, na.rm = na.rm)
                lower_bound = q1 - (iqr * cutoff)
                upper_bound = q3 + (iqr * cutoff)
                if (!is.null(replacement)) {
                    x[(x > upper_bound) | (x < lower_bound)] <- replacement
                    } else {
                        if (length(which((x > upper_bound) | (x < lower_bound))) > 0) {
                            x = x[-which((x > upper_bound) | (x < lower_bound))]
                        }
                    }
                x.replace.na[(x.replace.na > upper_bound) | (x.replace.na < lower_bound)] <- NA
            }
    
            newNAs = sum(is.na(x.replace.na)) - oldNAs
            if (print2scr) {
                if (!is.null(col)) {
                    cat(sprintf('%-15s %5s(%.2f): %3d outliers found and %s.\n', toString(col), toupper(method), cutoff, newNAs, ifelse((is.null(replacement)|fillout=='todropna'),'REMOVED','REPLACED')))
                } else {
                    cat(sprintf('%5s(%.2f): %3d outliers found and %s.\n', toupper(method), cutoff, newNAs, ifelse((is.null(replacement)|fillout=='todropna'),'REMOVED','REPLACED')))
                }
            }
    
            if (plot){
                # mar controls margin size for individual plot it goes c(bottom, left, top, right)
                # oma is margin for the whole?
                opar = par(mfrow=c(2, 2), oma=c(0,0,1.5,0), mar = c(2,2,1.5,0.5))
                on.exit(par(opar))
                boxplot(x.bak.plot, main=sprintf("With outliers (n=%d)",length(x.bak.plot)))
                hist(x.bak.plot, main=sprintf("With outliers (n=%d)",length(x.bak.plot)), xlab=NULL, ylab=NULL)
    
                boxplot(x, main=sprintf("With outliers (n=%d)",length(x.bak.plot)-newNAs))
                hist(x, main=sprintf("With outliers (n=%d)",length(x.bak.plot)-newNAs), xlab=NULL, ylab=NULL)
                title(sprintf("%s Outlier Check: %s(%.2f)",toString(col), toupper(method), cutoff), outer=TRUE)
            }
        } else if (is.data.frame(x)) {
            if (length(col)>1 & fillout=='null') {
                cat('I do not know how to remove univariate outliers in multiple cols. fillout: null --> na ...\n')
                fillout='na'
            } else if (fillout=='null') {
                fillout='todropna'
            }
            # trick to pass actual col name
            x[col] = lapply(1:length(col), function(j) {ez.outlier(x=x[col][[j]],col=col[j],method=method,cutoff=cutoff,plot=plot,hack=hack,fillout=fillout,na.rm=na.rm,print2scr=print2scr)})
            if (fillout=='todropna') x=x[complete.cases(x[,col,drop=FALSE]),,drop=FALSE]
        } # end if
        return(invisible(x))
    }
    
    0 讨论(0)
  • 2020-12-05 02:10

    The following formulas could be used to determine which values are outliers:

    upper.outlier.calc <- function(x.var, df){
      with(df, quantile(x.var, 0.75) + (1.5 * (quantile(x.var, 0.75) - quantile(x.var, 0.25))))
    }
    
    lower.outlier.calc <- function(x.var, df){
      with(df, quantile(x.var, 0.25) - (1.5 * (quantile(x.var, 0.75) - quantile(x.var, 0.25))))
    }
    
    0 讨论(0)
提交回复
热议问题