How to calculate p-values for pairwise correlation of columns in Pandas?

后端 未结 4 1076
执念已碎
执念已碎 2021-02-09 11:01

Pandas has the very handy function to do pairwise correlation of columns using pd.corr(). That means it is possible to compare correlations between columns of any length. For in

4条回答
  •  自闭症患者
    2021-02-09 11:35

    Probably just loop. It's basically what pandas does in the source code to generate the correlation matrix anyway:

    import pandas as pd
    import numpy as np
    from scipy import stats
    
    df_corr = pd.DataFrame() # Correlation matrix
    df_p = pd.DataFrame()  # Matrix of p-values
    for x in df.columns:
        for y in df.columns:
            corr = stats.pearsonr(df[x], df[y])
            df_corr.loc[x,y] = corr[0]
            df_p.loc[x,y] = corr[1]
    

    If you want to leverage the fact that this is symmetric, so you only need to calculate this for roughly half of them, then do:

    mat = df.values.T
    K = len(df.columns)
    correl = np.empty((K,K), dtype=float)
    p_vals = np.empty((K,K), dtype=float)
    
    for i, ac in enumerate(mat):
        for j, bc in enumerate(mat):
            if i > j:
                continue
            else:
                corr = stats.pearsonr(ac, bc)
                #corr = stats.kendalltau(ac, bc)
    
            correl[i,j] = corr[0]
            correl[j,i] = corr[0]
            p_vals[i,j] = corr[1]
            p_vals[j,i] = corr[1]
    
    df_p = pd.DataFrame(p_vals)
    df_corr = pd.DataFrame(correl)
    #pd.concat([df_corr, df_p], keys=['corr', 'p_val'])
    

提交回复
热议问题