Using pandas, calculate Cramér's coefficient matrix

前端 未结 4 845
名媛妹妹
名媛妹妹 2020-12-23 10:18

I have a dataframe in pandas which contains metrics calculated on Wikipedia articles. Two categorical variables nation which nation the article is

4条回答
  •  甜味超标
    2020-12-23 11:05

    cramers V seems pretty over optimistic in a few tests that I did. Wikipedia recommends a corrected version.

    def cramers_corrected_stat(confusion_matrix):
        """ calculate Cramers V statistic for categorial-categorial association.
            uses correction from Bergsma and Wicher, 
            Journal of the Korean Statistical Society 42 (2013): 323-328
        """
        chi2 = ss.chi2_contingency(confusion_matrix)[0]
        n = confusion_matrix.sum()
        phi2 = chi2/n
        r,k = confusion_matrix.shape
        phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))    
        rcorr = r - ((r-1)**2)/(n-1)
        kcorr = k - ((k-1)**2)/(n-1)
        return np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1)))
    

    Also note that the confusion matrix can be calculated via a built-in pandas method for categorical columns via:

    import pandas as pd
    confusion_matrix = pd.crosstab(df[column1], df[column2])
    

提交回复
热议问题