R cor(), method=“pearson” returns NA, but method=“spearman” returns value. Why?

余生长醉 提交于 2019-12-01 11:33:26

问题


I am using R to run correlations on a very large data matrix with approximate dimension 10,000 x 15,000 (events x samples). This data set contains floating point values ranging from -15:15, NA, NaN, inf, and -inf. To simplify the problem I have chosen to work with two rows of my matrix at a time, call them vector1, vector2. The commands are written below:

CorrelationSpearman = cor(vector1,vector2, method="spearman",use="pairwise.complete.obs")
CorrelationPearson = cor(vector1,vector2,method="pearson",use="pairwise.complete.obs")

For most but not all row vectors in my matrix, I get CorrelationPearson=NA. There seems to be no problem with with CorrelationSpearman values. I have checked that the matrix dimensions are correct, and I've run tests on smaller data which work fine. What are some possible reasons why this occurs?


回答1:


The Pearson correlation coefficient relies on estimating means and (co)variance. Infinite values lead to infinite means and infinite variances, which break computations. Spearman and Kendall correlation coefficients are rank-based, and thus handle sorting just fine with infinite values (but beware of tied values in your samples!).

Try:

> lix <- is.infinite(vector1) | is.infinite(vector2)
> cor(vector1[!lix], vector2[!lix], method = "pearson", use = "pairwise.complete.obs")

This just plucks out any pair with infinite values. To do this more generally, a function like this is helpful:

> inf2NA <- function(x) { x[is.infinite(x)] <- NA; x }
> cor(inf2NA(vector1), inf2NA(vector2), ...)

which just converts infinite values to NAs, and then your use argument can handle those NA cases as you see fit.



来源:https://stackoverflow.com/questions/27047598/r-cor-method-pearson-returns-na-but-method-spearman-returns-value-why

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!