Correlation coefficients and p values for all pairs of rows of a matrix

I have a matrix data with m rows and n columns. I used to compute the correlation coefficients between all pairs of rows using np.corrcoef:

import numpy as np
data = np.array([[0, 1, -1], [0, -1, 1]])
np.corrcoef(data)

Now I would also like to have a look at the p-values of these coefficients. np.corrcoef doesn't provide these; scipy.stats.pearsonr does. However, scipy.stats.pearsonr does not accept a matrix on input.

Is there a quick way how to compute both the coefficient and the p-value for all pairs of rows (arriving e.g. at two m by m matrices, one with correlation coefficients, the other with corresponding p-values) without having to manually go through all pairs?

I have encountered the same problem today.

After half an hour of googling, I can't find any code in numpy/scipy library can help me do this.

So I wrote my own version of corrcoef

import numpy as np
from scipy.stats import pearsonr, betai

def corrcoef(matrix):
    r = np.corrcoef(matrix)
    rf = r[np.triu_indices(r.shape[0], 1)]
    df = matrix.shape[1] - 2
    ts = rf * rf * (df / (1 - rf * rf))
    pf = betai(0.5 * df, 0.5, df / (df + ts))
    p = np.zeros(shape=r.shape)
    p[np.triu_indices(p.shape[0], 1)] = pf
    p[np.tril_indices(p.shape[0], -1)] = pf
    p[np.diag_indices(p.shape[0])] = np.ones(p.shape[0])
    return r, p

def corrcoef_loop(matrix):
    rows, cols = matrix.shape[0], matrix.shape[1]
    r = np.ones(shape=(rows, rows))
    p = np.ones(shape=(rows, rows))
    for i in range(rows):
        for j in range(i+1, rows):
            r_, p_ = pearsonr(matrix[i], matrix[j])
            r[i, j] = r[j, i] = r_
            p[i, j] = p[j, i] = p_
    return r, p

The first version use the result of np.corrcoef, and then calculate p-value based on triangle-upper values of corrcoef matrix.

The second loop version just iterating over rows, do pearsonr manually.

def test_corrcoef():
    a = np.array([
        [1, 2, 3, 4],
        [1, 3, 1, 4],
        [8, 3, 8, 5]])

    r1, p1 = corrcoef(a)
    r2, p2 = corrcoef_loop(a)

    assert np.allclose(r1, r2)
    assert np.allclose(p1, p2)

The test passed, they are the same.

def test_timing():
    import time
    a = np.random.randn(100, 2500)

    def timing(func, *args, **kwargs):
        t0 = time.time()
        loops = 10
        for _ in range(loops):
            func(*args, **kwargs)
        print('{} takes {} seconds loops={}'.format(
            func.__name__, time.time() - t0, loops))

    timing(corrcoef, a)
    timing(corrcoef_loop, a)


if __name__ == '__main__':
    test_corrcoef()
    test_timing()

The performance on my Macbook against 100x2500 matrix

corrcoef takes 0.06608104705810547 seconds loops=10

corrcoef_loop takes 7.585600137710571 seconds loops=10

The most consice way of doing it might be the buildin method .corr in pandas, to get r:

In [79]:

import pandas as pd
m=np.random.random((6,6))
df=pd.DataFrame(m)
print df.corr()
          0         1         2         3         4         5
0  1.000000 -0.282780  0.455210 -0.377936 -0.850840  0.190545
1 -0.282780  1.000000 -0.747979 -0.461637  0.270770  0.008815
2  0.455210 -0.747979  1.000000 -0.137078 -0.683991  0.557390
3 -0.377936 -0.461637 -0.137078  1.000000  0.511070 -0.801614
4 -0.850840  0.270770 -0.683991  0.511070  1.000000 -0.499247
5  0.190545  0.008815  0.557390 -0.801614 -0.499247  1.000000

To get p values using t-test:

In [84]:

n=6
r=df.corr()
t=r*np.sqrt((n-2)/(1-r*r))

import scipy.stats as ss
ss.t.cdf(t, n-2)
Out[84]:
array([[ 1.        ,  0.2935682 ,  0.817826  ,  0.23004382,  0.01585695,
         0.64117917],
       [ 0.2935682 ,  1.        ,  0.04363408,  0.17836685,  0.69811422,
         0.50661121],
       [ 0.817826  ,  0.04363408,  1.        ,  0.39783538,  0.06700715,
         0.8747497 ],
       [ 0.23004382,  0.17836685,  0.39783538,  1.        ,  0.84993082,
         0.02756579],
       [ 0.01585695,  0.69811422,  0.06700715,  0.84993082,  1.        ,
         0.15667393],
       [ 0.64117917,  0.50661121,  0.8747497 ,  0.02756579,  0.15667393,
         1.        ]])
In [85]:

ss.pearsonr(m[:,0], m[:,1])
Out[85]:
(-0.28277983892175751, 0.58713640696703184)
In [86]:
#be careful about the difference of 1-tail test and 2-tail test:
0.58713640696703184/2
Out[86]:
0.2935682034835159 #the value in ss.t.cdf(t, n-2) [0,1] cell

Also you can just use the scipy.stats.pearsonr you mentioned in OP:

In [95]:
#returns a list of tuples of (r, p, index1, index2)
import itertools
[ss.pearsonr(m[:,i],m[:,j])+(i, j) for i, j in itertools.product(range(n), range(n))]
Out[95]:
[(1.0, 0.0, 0, 0),
 (-0.28277983892175751, 0.58713640696703184, 0, 1),
 (0.45521036266021014, 0.36434799921123057, 0, 2),
 (-0.3779357902414715, 0.46008763115463419, 0, 3),
 (-0.85083961671703368, 0.031713908656676448, 0, 4),
 (0.19054495489542525, 0.71764166168348287, 0, 5),
 (-0.28277983892175751, 0.58713640696703184, 1, 0),
 (1.0, 0.0, 1, 1),
#etc, etc

Sort of hackish and possibly inefficient, but I think this could be what you're looking for:

import scipy.spatial.distance as dist

import scipy.stats as ss

# Pearson's correlation coefficients
print dist.squareform(dist.pdist(data, lambda x, y: ss.pearsonr(x, y)[0]))    

# p-values
print dist.squareform(dist.pdist(data, lambda x, y: ss.pearsonr(x, y)[1]))

Scipy's pdist is a very helpful function, which is primarily meant for finding Pairwise distances between observations in n-dimensional space.

But it allows user defined callable 'distance metrics', which can be exploited to carry out any kind of pair-wise operation. The result is returned in a condensed distance matrix form, which can be easily changed to the square matrix form using Scipy's 'squareform' function.

If you do not have to use pearson correlation coefficient, you can use the spearman correlation coefficient, as it returns both the correlation matrix and p-values (note that the former requires that your data is normally distributed, whereas the spearman correlation is a non-parametric measure, thus not assuming the normal distribution of your data). An example code:

from scipy import stats
import numpy as np

data = np.array([[0, 1, -1], [0, -1, 1], [0, 1, -1]])
print 'np.corrcoef:', np.corrcoef(data)
cor, pval = stats.spearmanr(data.T)
print 'stats.spearmanr - cor:\n', cor
print 'stats.spearmanr - pval\n', pval

this is exactly the same performance as the corrcoef in MATLAB:

to have this function work, you will need to install pandas as well as scipy.

# Compute correlation correfficients matrix and p-value matrix
# Similar function as corrcoef in MATLAB
# dframe: pandas dataframe
def corrcoef(dframe):

    fmatrix = dframe.values
    rows, cols = fmatrix.shape

    r = np.ones((cols, cols), dtype=float)
    p = np.ones((cols, cols), dtype=float)

    for i in range(cols):
        for j in range(cols):
            if i == j:
                r_, p_ = 1., 1.
            else:
                r_, p_ = pearsonr(fmatrix[:,i], fmatrix[:,j])

            r[j][i] = r_
            p[j][i] = p_

    return r, p

来源：https://stackoverflow.com/questions/24432101/correlation-coefficients-and-p-values-for-all-pairs-of-rows-of-a-matrix

标签

python

numpy

statistics

scipy

correlation