问题
I have a data in pandas dataframe like:
df =
X1 X2 X3 Y
0 1 2 10 5.077
1 2 2 9 32.330
2 3 3 5 65.140
3 4 4 4 47.270
4 5 2 9 80.570
and I want to do multiple regression analysis. Here Y is dependent variables and x1, x2 and x3 are independent variables. correlation between each independent variables with dependent variable is:
df.corr():
X1 X2 X3 Y
X1 1.000000 0.353553 -0.409644 0.896626
X2 0.353553 1.000000 -0.951747 0.204882
X3 -0.409644 -0.951747 1.000000 -0.389641
Y 0.896626 0.204882 -0.389641 1.000000
As we can see here y has highest correlation with x1 so i have selected x1 as first independent variable. And following the process I am trying to select second independent variable with highest partial correlation with y. So my question is how to find partial correlation in such case?
Your help will be highly appreciated.
回答1:
Pairwise ranks between Y
(last col) and others
If you are only trying to find the correlation rank between Y
and others, simply do -
corrs = df.corr().values
ranks = (df.columns[:-1][-corrs[:-1,-1].argsort()]).tolist()
Sample run -
In [145]: df
Out[145]:
X1 X2 X3 Y
0 0.576562 0.481220 0.148405 0.929005
1 0.732278 0.934351 0.115578 0.379051
2 0.078430 0.575374 0.945908 0.999495
3 0.391323 0.429919 0.265165 0.837510
4 0.525265 0.331486 0.951865 0.998278
In [146]: df.corr()
Out[146]:
X1 X2 X3 Y
X1 1.000000 0.354387 -0.642953 -0.646551
X2 0.354387 1.000000 -0.461510 -0.885174
X3 -0.642953 -0.461510 1.000000 0.649758
Y -0.646551 -0.885174 0.649758 1.000000
In [147]: corrs = df.corr().values
In [148]: (df.columns[:-1][-corrs[:-1,-1].argsort()]).tolist()
Out[148]: ['X3', 'X1', 'X2']
Pairwise ranks between all columns
If you are trying to find the rank between all columns between each other, we would have one approach like so -
def pairwise_corr_rank(df):
corrs = df.corr().values
cols = df.columns
n = corrs.shape[0]
r,c = np.triu_indices(n,1)
idx = corrs[r,c].argsort()
out = np.c_[cols[r[idx]], cols[c[idx]], corrs[r,c][idx]][::-1]
return pd.DataFrame(out, columns=[['P1','P2','Value']])
Sample run -
In [109]: df
Out[109]:
X1 X2 X3 Y
0 1 2 10 5.077
1 2 2 9 32.330
2 3 3 5 65.140
3 4 4 4 47.270
4 5 2 9 80.570
In [110]: df.corr()
Out[110]:
X1 X2 X3 Y
X1 1.000000 0.353553 -0.409644 0.896626
X2 0.353553 1.000000 -0.951747 0.204882
X3 -0.409644 -0.951747 1.000000 -0.389641
Y 0.896626 0.204882 -0.389641 1.000000
In [114]: pairwise_corr_rank(df)
Out[114]:
P1 P2 Value
0 X1 Y 0.896626
1 X1 X2 0.353553
2 X2 Y 0.204882
3 X3 Y -0.389641
4 X1 X3 -0.409644
5 X2 X3 -0.951747
来源:https://stackoverflow.com/questions/44843134/partial-correlation-coefficient-in-pandas-dataframe-python