问题
I am using R to run correlations on a very large data matrix with approximate dimension 10,000 x 15,000 (events x samples). This data set contains floating point values ranging from -15:15, NA, NaN, inf, and -inf. To simplify the problem I have chosen to work with two rows of my matrix at a time, call them vector1, vector2. The commands are written below:
CorrelationSpearman = cor(vector1,vector2, method="spearman",use="pairwise.complete.obs")
CorrelationPearson = cor(vector1,vector2,method="pearson",use="pairwise.complete.obs")
For most but not all row vectors in my matrix, I get CorrelationPearson=NA. There seems to be no problem with with CorrelationSpearman values. I have checked that the matrix dimensions are correct, and I've run tests on smaller data which work fine. What are some possible reasons why this occurs?
回答1:
The Pearson correlation coefficient relies on estimating means and (co)variance. Infinite values lead to infinite means and infinite variances, which break computations. Spearman and Kendall correlation coefficients are rank-based, and thus handle sorting just fine with infinite values (but beware of tied values in your samples!).
Try:
> lix <- is.infinite(vector1) | is.infinite(vector2)
> cor(vector1[!lix], vector2[!lix], method = "pearson", use = "pairwise.complete.obs")
This just plucks out any pair with infinite values. To do this more generally, a function like this is helpful:
> inf2NA <- function(x) { x[is.infinite(x)] <- NA; x }
> cor(inf2NA(vector1), inf2NA(vector2), ...)
which just converts infinite values to NAs, and then your use
argument can handle those NA cases as you see fit.
来源:https://stackoverflow.com/questions/27047598/r-cor-method-pearson-returns-na-but-method-spearman-returns-value-why