Mean of a correlation matrix - pandas data fram

匿名 (未验证) 提交于 2019-12-03 00:57:01

问题:

I have a large correlation matrix in a pandas python DataFrame: df (342, 342).

How do I take the mean, sd, etc. of all of the numbers in the upper triangle not including the 1's along the diagonal?

Thank you.

回答1:

Another potential one line answer:

In [1]: corr Out[1]:           a         b         c         d         e a  1.000000  0.022246  0.018614  0.022592  0.008520 b  0.022246  1.000000  0.033029  0.049714 -0.008243 c  0.018614  0.033029  1.000000 -0.016244  0.049010 d  0.022592  0.049714 -0.016244  1.000000 -0.015428 e  0.008520 -0.008243  0.049010 -0.015428  1.000000  In [2]: corr.values[np.triu_indices_from(corr.values,1)].mean() Out[2]: 0.016381

Edit: added performance metrics

Performance of my solution:

In [3]: %timeit corr.values[np.triu_indices_from(corr.values,1)].mean() 10000 loops, best of 3: 48.1 us per loop

Performance of Theodros Zelleke's one-line solution:

In [4]: %timeit corr.unstack().ix[zip(*np.triu_indices_from(corr, 1))].mean() 1000 loops, best of 3: 823 us per loop

Performance of DSM's solution:

In [5]: def method1(df):    ...:     df2 = df.copy()    ...:     df2.values[np.tril_indices_from(df2)] = np.nan    ...:     return df2.unstack().mean()    ...:  In [5]: %timeit method1(corr) 1000 loops, best of 3: 242 us per loop


回答2:

This is kind of fun. I make no guarantees that this is the real pandas-fu; I'm still at the "numpy + better indexing" stage of learning pandas myself. That said, something like this should get the job done.

First, we make a toy correlation matrix to play with:

>>> import pandas as pd >>> import numpy as np >>> frame = pd.DataFrame(np.random.randn(1000, 5), columns=['a', 'b', 'c', 'd', 'e']) >>> corr = frame.corr() >>> corr           a         b         c         d         e a  1.000000  0.022246  0.018614  0.022592  0.008520 b  0.022246  1.000000  0.033029  0.049714 -0.008243 c  0.018614  0.033029  1.000000 -0.016244  0.049010 d  0.022592  0.049714 -0.016244  1.000000 -0.015428 e  0.008520 -0.008243  0.049010 -0.015428  1.000000

Then we make a copy, and use tril_indices_from to get at the lower indices to mask them:

>>> c2 = corr.copy() >>> c2.values[np.tril_indices_from(c2)] = np.nan >>> c2     a        b         c         d         e a NaN  0.06952 -0.021632 -0.028412 -0.029729 b NaN      NaN -0.022343 -0.063658  0.055247 c NaN      NaN       NaN -0.013272  0.029102 d NaN      NaN       NaN       NaN -0.046877 e NaN      NaN       NaN       NaN       NaN

and now we can do stats on the flattened array:

>>> c2.unstack().mean() -0.0072054178481488901 >>> c2.unstack().std() 0.043839624201635466


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!