Information Gain calculation with Scikit-learn

后端 未结 3 651
温柔的废话
温柔的废话 2020-12-14 07:20

I am using Scikit-learn for text classification. I want to calculate the Information Gain for each attribute with respect to a class in a (sparse) document-term matrix.

3条回答
  •  臣服心动
    2020-12-14 07:40

    Here is my proposition to calculate the information gain using pandas:

    from scipy.stats import entropy
    import pandas as pd
    def information_gain(members, split):
        '''
        Measures the reduction in entropy after the split  
        :param v: Pandas Series of the members
        :param split:
        :return:
        '''
        entropy_before = entropy(members.value_counts(normalize=True))
        split.name = 'split'
        members.name = 'members'
        grouped_distrib = members.groupby(split) \
                            .value_counts(normalize=True) \
                            .reset_index(name='count') \
                            .pivot_table(index='split', columns='members', values='count').fillna(0) 
        entropy_after = entropy(grouped_distrib, axis=1)
        entropy_after *= split.value_counts(sort=False, normalize=True)
        return entropy_before - entropy_after.sum()
    
    members = pd.Series(['yellow','yellow','green','green','blue'])
    split = pd.Series([0,0,1,1,0])
    print (information_gain(members, split))
    

提交回复
热议问题