numpy corrcoef - compute correlation matrix while ignoring missing data

↘锁芯ラ 提交于 2019-11-30 06:46:00
Jianxun Li

One of the main features of pandas is being NaN friendly. To calculate correlation matrix, simply call df_counties.corr(). Below is an example to demonstrate df.corr() is NaN tolerant whereas np.corrcoef is not.

import pandas as pd
import numpy as np

# data
# ==============================
np.random.seed(0)
df = pd.DataFrame(np.random.randn(100,5), columns=list('ABCDE'))
df[df < 0] = np.nan
df

         A       B       C       D       E
0   1.7641  0.4002  0.9787  2.2409  1.8676
1      NaN  0.9501     NaN     NaN  0.4106
2   0.1440  1.4543  0.7610  0.1217  0.4439
3   0.3337  1.4941     NaN  0.3131     NaN
4      NaN  0.6536  0.8644     NaN  2.2698
5      NaN  0.0458     NaN  1.5328  1.4694
6   0.1549  0.3782     NaN     NaN     NaN
7   0.1563  1.2303  1.2024     NaN     NaN
8      NaN     NaN     NaN  1.9508     NaN
9      NaN     NaN  0.7775     NaN     NaN
..     ...     ...     ...     ...     ...
90     NaN  0.8202  0.4631  0.2791  0.3389
91  2.0210     NaN     NaN  0.1993     NaN
92     NaN     NaN     NaN  0.1813     NaN
93  2.4125     NaN     NaN     NaN  0.2515
94     NaN     NaN     NaN     NaN  1.7389
95  0.9944  1.3191     NaN  1.1286  0.4960
96  0.7714  1.0294     NaN     NaN  0.8626
97     NaN  1.5133  0.5531     NaN  0.2205
98     NaN     NaN  1.1003  1.2980  2.6962
99     NaN     NaN     NaN     NaN     NaN

[100 rows x 5 columns]

# calculations
# ================================
df.corr()

        A       B       C       D       E
A  1.0000  0.2718  0.2678  0.2822  0.1016
B  0.2718  1.0000 -0.0692  0.1736 -0.1432
C  0.2678 -0.0692  1.0000 -0.3392  0.0012
D  0.2822  0.1736 -0.3392  1.0000  0.1562
E  0.1016 -0.1432  0.0012  0.1562  1.0000


np.corrcoef(df, rowvar=False)

array([[ nan,  nan,  nan,  nan,  nan],
       [ nan,  nan,  nan,  nan,  nan],
       [ nan,  nan,  nan,  nan,  nan],
       [ nan,  nan,  nan,  nan,  nan],
       [ nan,  nan,  nan,  nan,  nan]])

This will work, using the masked array numpy module:

import numpy as np
import numpy.ma as ma

A = [1, 2, 3, 4, 5, np.NaN]
B = [2, 3, 4, 5.25, np.NaN, 100]

print(ma.corrcoef(ma.masked_invalid(A), ma.masked_invalid(B)))

It outputs:

[[1.0 0.99838143945703]
 [0.99838143945703 1.0]]

Read more here: https://docs.scipy.org/doc/numpy/reference/maskedarray.generic.html

In case you expect a different number of nans in each array, you may consider taking a logical AND of non-nan masks.

import numpy as np
import numpy.ma as ma

a=ma.masked_invalid(A)
b=ma.masked_invalid(B)

msk = (~a.mask & ~b.mask)

print(ma.corrcoef(a[msk],b[msk]))
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!