Remove outliers from correlation coefficient calculation

前端 未结 5 1982
有刺的猬
有刺的猬 2021-01-31 22:57

Assume we have two numeric vectors x and y. The Pearson correlation coefficient between x and y is given by

5条回答
  •  青春惊慌失措
    2021-01-31 23:10

    This may have been already obvious to the OP, but just to make sure... You have to be careful because trying to maxmimize correlation may actually tend to include outliers. (@Gavin touched on this point in his answer/comments.) I would be first removing outliers, then calculating a correlation. More generally, we want to be calculating a correlation that is robust to outliers (and there are many such methods in R).

    Just to illustrate this dramatically, let's create two vectors x and y that are uncorrelated:

    set.seed(1)
    x <- rnorm(1000)
    y <- rnorm(1000)
    > cor(x,y)
    [1] 0.006401211
    

    Now let's add an outlier point (500,500):

    x <- c(x, 500)
    y <- c(y, 500)
    

    Now the correlation of any subset that includes the outlier point will be close to 100%, and the correlation of any sufficiently large subset that excludes the outlier will be close to zero. In particular,

    > cor(x,y)
    [1] 0.995741
    

    If you want to estimate a "true" correlation that is not sensitive to outliers, you might try the robust package:

    require(robust)
    > covRob(cbind(x,y), corr = TRUE)
    Call:
    covRob(data = cbind(x, y), corr = TRUE)
    
    Robust Estimate of Correlation: 
                x           y
    x  1.00000000 -0.02594260
    y -0.02594260  1.00000000
    

    You can play around with parameters of covRob to decide how to trim the data. UPDATE: There is also the rlm (robust linear regression) in the MASS package.

提交回复
热议问题