Remove outliers from correlation coefficient calculation

前端未结

关注

 5  1982

有刺的猬 2021-01-31 22:57

Assume we have two numeric vectors x and y. The Pearson correlation coefficient between x and y is given by

5条回答

青春惊慌失措 (楼主)

2021-01-31 23:10
This may have been already obvious to the OP, but just to make sure... You have to be careful because trying to maxmimize correlation may actually tend to include outliers. (@Gavin touched on this point in his answer/comments.) I would be first removing outliers, then calculating a correlation. More generally, we want to be calculating a correlation that is robust to outliers (and there are many such methods in R).

Just to illustrate this dramatically, let's create two vectors x and y that are uncorrelated:
```
set.seed(1)
x <- rnorm(1000)
y <- rnorm(1000)
> cor(x,y)
[1] 0.006401211
```
Now let's add an outlier point (500,500):
```
x <- c(x, 500)
y <- c(y, 500)
```
Now the correlation of any subset that includes the outlier point will be close to 100%, and the correlation of any sufficiently large subset that excludes the outlier will be close to zero. In particular,
```
> cor(x,y)
[1] 0.995741
```
If you want to estimate a "true" correlation that is not sensitive to outliers, you might try the robust package:
```
require(robust)
> covRob(cbind(x,y), corr = TRUE)
Call:
covRob(data = cbind(x, y), corr = TRUE)

Robust Estimate of Correlation: 
            x           y
x  1.00000000 -0.02594260
y -0.02594260  1.00000000
```
You can play around with parameters of covRob to decide how to trim the data. UPDATE: There is also the rlm (robust linear regression) in the MASS package.
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...