Pandas simple correlation of two grouped DataFrame columns

孤街浪徒 提交于 2020-02-05 06:34:30

问题


Is there a good way to get the simple correlation of two grouped DataFrame columns?

It seems like no matter what the pandas .corr() functions want to return a correlation matrix. E.g.,

i = pd.MultiIndex.from_product([['A','B','C'], np.arange(1, 11, 1)], names=['Name','Num'])
test = pd.DataFrame(np.random.randn(30, 2), i, columns=['X', 'Y'])
test.groupby(['Name'])['X','Y'].corr()

returns

               X         Y
Name                      
A    X  1.000000  0.152663
     Y  0.152663  1.000000
B    X  1.000000 -0.155113
     Y -0.155113  1.000000
C    X  1.000000  0.214197
     Y  0.214197  1.000000

But clearly I am only interested in the off-diagonal term. And it seems kludgy to calculate the four values and then try to select the one I want, as in

test.groupby(['Name'])['X','Y'].corr().ix[0::2,'Y']

to get

A     X    0.152663
B     X   -0.155113
C     X    0.214197

回答1:


I would expect something like test.groupby('Name')['X'].corr('Y') to work but it doesn't and when you pass the Series itself (test['Y']) it becomes slower. At this point it seems apply is the best option:

test.groupby('Name').apply(lambda df: df['X'].corr(df['Y']))
Out: 
Name
A   -0.484955
B    0.520701
C    0.120879
dtype: float64

This iterates over each group and applies Series.corr in each grouped DataFrame. The differences arise from not setting a random seed.



来源:https://stackoverflow.com/questions/48570130/pandas-simple-correlation-of-two-grouped-dataframe-columns

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!