partial correlation coefficient in pandas dataframe python

二次信任 提交于 2019-12-22 07:59:26

问题


I have a data in pandas dataframe like:

df = 

    X1  X2  X3  Y
0   1   2   10  5.077
1   2   2   9   32.330
2   3   3   5   65.140
3   4   4   4   47.270
4   5   2   9   80.570

and I want to do multiple regression analysis. Here Y is dependent variables and x1, x2 and x3 are independent variables. correlation between each independent variables with dependent variable is:

df.corr():

      X1          X2            X3         Y
X1  1.000000    0.353553    -0.409644   0.896626
X2  0.353553    1.000000    -0.951747   0.204882
X3  -0.409644   -0.951747   1.000000    -0.389641
Y   0.896626    0.204882    -0.389641   1.000000

​As we can see here y has highest correlation with x1 so i have selected x1 as first independent variable. And following the process I am trying to select second independent variable with highest partial correlation with y. So my question is how to find partial correlation in such case?

Your help will be highly appreciated.


回答1:


Pairwise ranks between Y (last col) and others

If you are only trying to find the correlation rank between Y and others, simply do -

corrs = df.corr().values
ranks = (df.columns[:-1][-corrs[:-1,-1].argsort()]).tolist()

Sample run -

In [145]: df
Out[145]: 
         X1        X2        X3         Y
0  0.576562  0.481220  0.148405  0.929005
1  0.732278  0.934351  0.115578  0.379051
2  0.078430  0.575374  0.945908  0.999495
3  0.391323  0.429919  0.265165  0.837510
4  0.525265  0.331486  0.951865  0.998278

In [146]: df.corr()
Out[146]: 
          X1        X2        X3         Y
X1  1.000000  0.354387 -0.642953 -0.646551
X2  0.354387  1.000000 -0.461510 -0.885174
X3 -0.642953 -0.461510  1.000000  0.649758
Y  -0.646551 -0.885174  0.649758  1.000000

In [147]: corrs = df.corr().values

In [148]: (df.columns[:-1][-corrs[:-1,-1].argsort()]).tolist()
Out[148]: ['X3', 'X1', 'X2']

Pairwise ranks between all columns

If you are trying to find the rank between all columns between each other, we would have one approach like so -

def pairwise_corr_rank(df):
    corrs = df.corr().values
    cols = df.columns
    n = corrs.shape[0]
    r,c = np.triu_indices(n,1)
    idx = corrs[r,c].argsort()
    out = np.c_[cols[r[idx]], cols[c[idx]], corrs[r,c][idx]][::-1]
    return pd.DataFrame(out, columns=[['P1','P2','Value']])

Sample run -

In [109]: df
Out[109]: 
   X1  X2  X3       Y
0   1   2  10   5.077
1   2   2   9  32.330
2   3   3   5  65.140
3   4   4   4  47.270
4   5   2   9  80.570

In [110]: df.corr()
Out[110]: 
          X1        X2        X3         Y
X1  1.000000  0.353553 -0.409644  0.896626
X2  0.353553  1.000000 -0.951747  0.204882
X3 -0.409644 -0.951747  1.000000 -0.389641
Y   0.896626  0.204882 -0.389641  1.000000

In [114]: pairwise_corr_rank(df)
Out[114]: 
   P1  P2     Value
0  X1   Y  0.896626
1  X1  X2  0.353553
2  X2   Y  0.204882
3  X3   Y -0.389641
4  X1  X3 -0.409644
5  X2  X3 -0.951747


来源:https://stackoverflow.com/questions/44843134/partial-correlation-coefficient-in-pandas-dataframe-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!