Using pandas, calculate Cramér's coefficient matrix

前端未结

关注

 4  842

名媛妹妹 2020-12-23 10:18

I have a dataframe in pandas which contains metrics calculated on Wikipedia articles. Two categorical variables nation which nation the article is

4条回答

情话喂你 (楼主)

2020-12-23 11:10
Cramer's V statistic allows to understand correlation between two categorical features in one data set. So, it is your case.

To calculate Cramers V statistic you need to calculate confusion matrix. So, solution steps are:
1. Filter data for a single metric
2. Calculate confusion matrix
3. Calculate Cramers V statistic

Of course, you can do those steps in loop nest provided in your post. But in your starting paragraph you mention only metrics as an outer parameter, so I am not sure that you need both loops. Now, I will provide code for steps 2-3, because filtering is simple and as I mentioned I am not sure what you certainely need.

Step 2. In the code below data is a pandas.dataFrame filtered by whatever you want on step 1.
```
import numpy as np

confusions = []
for nation in list_of_nations:
    for language in list_of_languges:
        cond = data['nation'] == nation and data['lang'] == language
        confusions.append(cond.sum())
confusion_matrix = np.array(confusions).reshape(len(list_of_nations), len(list_of_languges))
```
Step 3. In the code below confusion_matrix is a numpy.ndarray obtained on step 2.
```
import numpy as np
import scipy.stats as ss

def cramers_stat(confusion_matrix):
    chi2 = ss.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum()
    return np.sqrt(chi2 / (n*(min(confusion_matrix.shape)-1)))

result = cramers_stat(confusion_matrix)
```
This code was tested on my data set, but I hope it is ok to use it without changes in your case.
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...