Fast Information Gain computation

前端 未结 3 447
离开以前
离开以前 2020-12-29 12:09

I need to compute Information Gain scores for >100k features in >10k documents for text classification. Code below works fine but f

相关标签:
3条回答
  • 2020-12-29 12:57

    Don't know whether it still helps since a year has passed. But now I happen to be faced with the same task for text classification. I've rewritten your code using the nonzero() function provided for sparse matrix. Then I just scan nz, count the corresponding y_value and calculate the entropy.

    The following code only needs seconds to run news20 dataset (loaded in using libsvm sparse matrix format).

    def information_gain(X, y):
    
        def _calIg():
            entropy_x_set = 0
            entropy_x_not_set = 0
            for c in classCnt:
                probs = classCnt[c] / float(featureTot)
                entropy_x_set = entropy_x_set - probs * np.log(probs)
                probs = (classTotCnt[c] - classCnt[c]) / float(tot - featureTot)
                entropy_x_not_set = entropy_x_not_set - probs * np.log(probs)
            for c in classTotCnt:
                if c not in classCnt:
                    probs = classTotCnt[c] / float(tot - featureTot)
                    entropy_x_not_set = entropy_x_not_set - probs * np.log(probs)
            return entropy_before - ((featureTot / float(tot)) * entropy_x_set
                                 +  ((tot - featureTot) / float(tot)) * entropy_x_not_set)
    
        tot = X.shape[0]
        classTotCnt = {}
        entropy_before = 0
        for i in y:
            if i not in classTotCnt:
                classTotCnt[i] = 1
            else:
                classTotCnt[i] = classTotCnt[i] + 1
        for c in classTotCnt:
            probs = classTotCnt[c] / float(tot)
            entropy_before = entropy_before - probs * np.log(probs)
    
        nz = X.T.nonzero()
        pre = 0
        classCnt = {}
        featureTot = 0
        information_gain = []
        for i in range(0, len(nz[0])):
            if (i != 0 and nz[0][i] != pre):
                for notappear in range(pre+1, nz[0][i]):
                    information_gain.append(0)
                ig = _calIg()
                information_gain.append(ig)
                pre = nz[0][i]
                classCnt = {}
                featureTot = 0
            featureTot = featureTot + 1
            yclass = y[nz[1][i]]
            if yclass not in classCnt:
                classCnt[yclass] = 1
            else:
                classCnt[yclass] = classCnt[yclass] + 1
        ig = _calIg()
        information_gain.append(ig)
    
        return np.asarray(information_gain)
    
    0 讨论(0)
  • 2020-12-29 12:59

    It is this code feature_not_set_indices = [i for i in feature_range if i not in feature_set_indices] takes 90% of the time, try to change to set operation

    0 讨论(0)
  • 2020-12-29 13:01

    Here is a version that uses matrix operations. The IG for a feature is a mean over its class-specific scores.

    import numpy as np
    from scipy.sparse import issparse
    from sklearn.preprocessing import LabelBinarizer
    from sklearn.utils import check_array
    from sklearn.utils.extmath import safe_sparse_dot
    
    
    def ig(X, y):
    
        def get_t1(fc, c, f):
            t = np.log2(fc/(c * f))
            t[~np.isfinite(t)] = 0
            return np.multiply(fc, t)
    
        def get_t2(fc, c, f):
            t = np.log2((1-f-c+fc)/((1-c)*(1-f)))
            t[~np.isfinite(t)] = 0
            return np.multiply((1-f-c+fc), t)
    
        def get_t3(c, f, class_count, observed, total):
            nfc = (class_count - observed)/total
            t = np.log2(nfc/(c*(1-f)))
            t[~np.isfinite(t)] = 0
            return np.multiply(nfc, t)
    
        def get_t4(c, f, feature_count, observed, total):
            fnc = (feature_count - observed)/total
            t = np.log2(fnc/((1-c)*f))
            t[~np.isfinite(t)] = 0
            return np.multiply(fnc, t)
    
        X = check_array(X, accept_sparse='csr')
        if np.any((X.data if issparse(X) else X) < 0):
            raise ValueError("Input X must be non-negative.")
    
        Y = LabelBinarizer().fit_transform(y)
        if Y.shape[1] == 1:
            Y = np.append(1 - Y, Y, axis=1)
    
        # counts
    
        observed = safe_sparse_dot(Y.T, X)          # n_classes * n_features
        total = observed.sum(axis=0).reshape(1, -1).sum()
        feature_count = X.sum(axis=0).reshape(1, -1)
        class_count = (X.sum(axis=1).reshape(1, -1) * Y).T
    
        # probs
    
        f = feature_count / feature_count.sum()
        c = class_count / float(class_count.sum())
        fc = observed / total
    
        # the feature score is averaged over classes
        scores = (get_t1(fc, c, f) +
                get_t2(fc, c, f) +
                get_t3(c, f, class_count, observed, total) +
                get_t4(c, f, feature_count, observed, total)).mean(axis=0)
    
        scores = np.asarray(scores).reshape(-1)
    
        return scores, []
    

    On a dataset with 1000 instances and 1000 unique features, this implementation is >100 faster than the one without matrix operations.

    0 讨论(0)
提交回复
热议问题