Weighted correlation coefficient with pandas

孤街醉人 提交于 2019-11-29 18:40:01

问题


Is there any way to compute weighted correlation coefficient with pandas? I saw that R has such a method. Also, I'd like to get the p value of the correlation. This I did not find also in R. Link to Wikipedia for explanation about weighted correlation: https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#Weighted_correlation_coefficient


回答1:


I don't know of any Python packages that implement this, but it should be fairly straightforward to roll your own implementation. Using the naming conventions of the wikipedia article:

def m(x, w):
    """Weighted Mean"""
    return np.sum(x * w) / np.sum(w)

def cov(x, y, w):
    """Weighted Covariance"""
    return np.sum(w * (x - m(x, w)) * (y - m(y, w))) / np.sum(w)

def corr(x, y, w):
    """Weighted Correlation"""
    return cov(x, y, w) / np.sqrt(cov(x, x, w) * cov(y, y, w))

I tried to make the functions above match the formulas in the wikipedia as closely as possible, but there are some potential simplifications and performance improvements. For example, as pointed out by @Alberto Garcia-Raboso, m(x, w) is really just np.average(x, weights=w), so there's no need to actually write a function for it.

The functions are pretty bare-bones, just doing the calculations. You may want to consider forcing inputs to be arrays prior to doing the calculations, i.e. x = np.asarray(x), as these functions will not work if lists are passed. Additional checks to verify all inputs have equal length, non-null values, etc. could also be implemented.

Example usage:

# Initialize a DataFrame.
np.random.seed([3,1415])
n = 10**6
df = pd.DataFrame({
    'x': np.random.choice(3, size=n),
    'y': np.random.choice(4, size=n),
    'w': np.random.random(size=n)
    })

# Compute the correlation.
r = corr(df['x'], df['y'], df['w'])

There's a discussion here regarding the p-value. It doesn't look like there's a generic calculation, and it depends on how you're actually getting the weights.




回答2:


The statsmodels package has an implementation of weighted correlation.



来源:https://stackoverflow.com/questions/38641691/weighted-correlation-coefficient-with-pandas

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!