问题
Is there a good way to get the simple correlation of two grouped DataFrame columns?
It seems like no matter what the pandas .corr()
functions want to return a correlation matrix. E.g.,
i = pd.MultiIndex.from_product([['A','B','C'], np.arange(1, 11, 1)], names=['Name','Num'])
test = pd.DataFrame(np.random.randn(30, 2), i, columns=['X', 'Y'])
test.groupby(['Name'])['X','Y'].corr()
returns
X Y
Name
A X 1.000000 0.152663
Y 0.152663 1.000000
B X 1.000000 -0.155113
Y -0.155113 1.000000
C X 1.000000 0.214197
Y 0.214197 1.000000
But clearly I am only interested in the off-diagonal term. And it seems kludgy to calculate the four values and then try to select the one I want, as in
test.groupby(['Name'])['X','Y'].corr().ix[0::2,'Y']
to get
A X 0.152663
B X -0.155113
C X 0.214197
回答1:
I would expect something like test.groupby('Name')['X'].corr('Y')
to work but it doesn't and when you pass the Series itself (test['Y']
) it becomes slower. At this point it seems apply is the best option:
test.groupby('Name').apply(lambda df: df['X'].corr(df['Y']))
Out:
Name
A -0.484955
B 0.520701
C 0.120879
dtype: float64
This iterates over each group and applies Series.corr in each grouped DataFrame. The differences arise from not setting a random seed.
来源:https://stackoverflow.com/questions/48570130/pandas-simple-correlation-of-two-grouped-dataframe-columns