Fast Information Gain computation

前端 未结 3 455
离开以前
离开以前 2020-12-29 12:09

I need to compute Information Gain scores for >100k features in >10k documents for text classification. Code below works fine but f

3条回答
  •  我在风中等你
    2020-12-29 12:57

    Don't know whether it still helps since a year has passed. But now I happen to be faced with the same task for text classification. I've rewritten your code using the nonzero() function provided for sparse matrix. Then I just scan nz, count the corresponding y_value and calculate the entropy.

    The following code only needs seconds to run news20 dataset (loaded in using libsvm sparse matrix format).

    def information_gain(X, y):
    
        def _calIg():
            entropy_x_set = 0
            entropy_x_not_set = 0
            for c in classCnt:
                probs = classCnt[c] / float(featureTot)
                entropy_x_set = entropy_x_set - probs * np.log(probs)
                probs = (classTotCnt[c] - classCnt[c]) / float(tot - featureTot)
                entropy_x_not_set = entropy_x_not_set - probs * np.log(probs)
            for c in classTotCnt:
                if c not in classCnt:
                    probs = classTotCnt[c] / float(tot - featureTot)
                    entropy_x_not_set = entropy_x_not_set - probs * np.log(probs)
            return entropy_before - ((featureTot / float(tot)) * entropy_x_set
                                 +  ((tot - featureTot) / float(tot)) * entropy_x_not_set)
    
        tot = X.shape[0]
        classTotCnt = {}
        entropy_before = 0
        for i in y:
            if i not in classTotCnt:
                classTotCnt[i] = 1
            else:
                classTotCnt[i] = classTotCnt[i] + 1
        for c in classTotCnt:
            probs = classTotCnt[c] / float(tot)
            entropy_before = entropy_before - probs * np.log(probs)
    
        nz = X.T.nonzero()
        pre = 0
        classCnt = {}
        featureTot = 0
        information_gain = []
        for i in range(0, len(nz[0])):
            if (i != 0 and nz[0][i] != pre):
                for notappear in range(pre+1, nz[0][i]):
                    information_gain.append(0)
                ig = _calIg()
                information_gain.append(ig)
                pre = nz[0][i]
                classCnt = {}
                featureTot = 0
            featureTot = featureTot + 1
            yclass = y[nz[1][i]]
            if yclass not in classCnt:
                classCnt[yclass] = 1
            else:
                classCnt[yclass] = classCnt[yclass] + 1
        ig = _calIg()
        information_gain.append(ig)
    
        return np.asarray(information_gain)
    

提交回复
热议问题