Assume we have two numeric vectors x
and y
. The Pearson correlation coefficient between x
and y
is given by
This may have been already obvious to the OP, but just to make sure... You have to be careful because trying to maxmimize correlation may actually tend to include outliers. (@Gavin touched on this point in his answer/comments.) I would be first removing outliers, then calculating a correlation. More generally, we want to be calculating a correlation that is robust to outliers (and there are many such methods in R).
Just to illustrate this dramatically, let's create two vectors x
and y
that are uncorrelated:
set.seed(1)
x <- rnorm(1000)
y <- rnorm(1000)
> cor(x,y)
[1] 0.006401211
Now let's add an outlier point (500,500)
:
x <- c(x, 500)
y <- c(y, 500)
Now the correlation of any subset that includes the outlier point will be close to 100%, and the correlation of any sufficiently large subset that excludes the outlier will be close to zero. In particular,
> cor(x,y)
[1] 0.995741
If you want to estimate a "true" correlation that is not sensitive to outliers, you might try the robust
package:
require(robust)
> covRob(cbind(x,y), corr = TRUE)
Call:
covRob(data = cbind(x, y), corr = TRUE)
Robust Estimate of Correlation:
x y
x 1.00000000 -0.02594260
y -0.02594260 1.00000000
You can play around with parameters of covRob
to decide how to trim the data.
UPDATE: There is also the rlm
(robust linear regression) in the MASS
package.