Efficient pairwise correlation for two matrices of features

后端 未结 2 648
天命终不由人
天命终不由人 2020-12-11 04:05

In Python I need to find the pairwise correlation between all features in a matrix A and all features in a matrix B. In particular, I am interestin

相关标签:
2条回答
  • 2020-12-11 04:43

    Adding on to the above answer from personal experience,

    p1 = N*np.dot(B.T,A)
    

    worked way faster for me when compared to

    p1 = N*np.einsum('ij,ik->kj',A,B)
    

    This was especially true when A and B are large dimensional matrices.

    0 讨论(0)
  • 2020-12-11 05:07

    Seems scipy.stats.pearsonr follows this definition of Pearson Correlation Coefficient Formula applied on column-wise pairs from A & B -

    Based on that formula, you can vectorized easily as the pairwise computations of columns from A and B are independent of each other. Here's one vectorized solution using broadcasting -

    # Get number of rows in either A or B
    N = B.shape[0]
    
    # Store columnw-wise in A and B, as they would be used at few places
    sA = A.sum(0)
    sB = B.sum(0)
    
    # Basically there are four parts in the formula. We would compute them one-by-one
    p1 = N*np.einsum('ij,ik->kj',A,B)
    p2 = sA*sB[:,None]
    p3 = N*((B**2).sum(0)) - (sB**2)
    p4 = N*((A**2).sum(0)) - (sA**2)
    
    # Finally compute Pearson Correlation Coefficient as 2D array 
    pcorr = ((p1 - p2)/np.sqrt(p4*p3[:,None]))
    
    # Get the element corresponding to absolute argmax along the columns 
    out = pcorr[np.nanargmax(np.abs(pcorr),axis=0),np.arange(pcorr.shape[1])]
    

    Sample run -

    1) Inputs :

    In [12]: A
    Out[12]: 
    array([[ 10.  ,   8.04,   9.14,   7.46],
           [  8.  ,   6.95,   8.14,   6.77],
           [ 13.  ,   7.58,   8.74,  12.74],
           [  9.  ,   8.81,   8.77,   7.11],
           [ 11.  ,   8.33,   9.26,   7.81]])
    
    In [13]: B
    Out[13]: 
    array([[-14.  ,  -9.96,   8.1 ,   8.84,   8.  ,   7.04],
           [ -6.  ,  -7.24,   6.13,   6.08,   5.  ,   5.25],
           [ -4.  ,  -4.26,   3.1 ,   5.39,   8.  ,   5.56],
           [-12.  , -10.84,   9.13,   8.15,   5.  ,   7.91],
           [ -7.  ,  -4.82,   7.26,   6.42,   8.  ,   6.89]])
    

    2) Original loopy code run -

    In [14]: high_corr_out = np.zeros(A.shape[1])
        ...: for A_col in range(A.shape[1]): 
        ...:     high_corr = 0
        ...:     for B_col in range(B.shape[1]):
        ...:         corr,_ = pearsonr(A[:,A_col], B[:,B_col])
        ...:         high_corr = max_absolute(high_corr, corr)
        ...:     high_corr_out[A_col] = high_corr
        ...:     
    
    In [15]: high_corr_out
    Out[15]: array([ 0.8067843 ,  0.95678152,  0.74016181, -0.85127779])
    

    3) Proposed code run -

    In [16]: N = B.shape[0]
        ...: sA = A.sum(0)
        ...: sB = B.sum(0)
        ...: p1 = N*np.einsum('ij,ik->kj',A,B)
        ...: p2 = sA*sB[:,None]
        ...: p3 = N*((B**2).sum(0)) - (sB**2)
        ...: p4 = N*((A**2).sum(0)) - (sA**2)
        ...: pcorr = ((p1 - p2)/np.sqrt(p4*p3[:,None]))
        ...: out = pcorr[np.nanargmax(np.abs(pcorr),axis=0),np.arange(pcorr.shape[1])]
        ...: 
    
    In [17]: pcorr # Pearson Correlation Coefficient array
    Out[17]: 
    array([[ 0.41895565, -0.5910935 , -0.40465987,  0.5818286 ],
           [ 0.66609445, -0.41950457,  0.02450215,  0.64028344],
           [-0.64953314,  0.65669916,  0.30836196, -0.85127779],
           [-0.41917583,  0.59043266,  0.40364532, -0.58144102],
           [ 0.8067843 ,  0.07947386,  0.74016181,  0.53165395],
           [-0.1613146 ,  0.95678152,  0.62107101, -0.4215393 ]])
    
    In [18]: out # elements corresponding to absolute argmax along columns
    Out[18]: array([ 0.8067843 ,  0.95678152,  0.74016181, -0.85127779])
    

    Runtime tests -

    In [36]: A = np.random.rand(4000,40)
    
    In [37]: B = np.random.rand(4000,144)
    
    In [38]: np.allclose(org_app(A,B),proposed_app(A,B))
    Out[38]: True
    
    In [39]: %timeit org_app(A,B) # Original approach
    1 loops, best of 3: 1.35 s per loop
    
    In [40]: %timeit proposed_app(A,B) # Proposed vectorized approach
    10 loops, best of 3: 39.1 ms per loop
    
    0 讨论(0)
提交回复
热议问题