How to interpret decision trees' graph results and find most informative features?

纵然是瞬间 提交于 2020-01-16 06:35:00

问题


I am using sk-learn python 27 and have output some decision tree feature results. Though I am not sure how to interpret the results. At first, I thought the features are listed from the most informative to least informative (from top to bottom), but examining the \nvalue it suggests otherwise. How do I identify the top 5 most informative features from the outputs or using python lines?

from sklearn import tree

tree.export_graphviz(classifierUsed2, feature_names=dv.get_feature_names(), out_file=treeFileName)     

# Output below
digraph Tree {
node [shape=box] ;
0 [label="avg-length <= 3.5\ngini = 0.0063\nsamples = 250000\nvalue = [249210, 790]"] ;
1 [label="name-entity <= 2.5\ngini = 0.5\nsamples = 678\nvalue = [338, 340]"] ;
0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"] ;
2 [label="first-name=wm <= 0.5\ngini = 0.4537\nsamples = 483\nvalue = [168, 315]"] ;
1 -> 2 ;
3 [label="name-entity <= 1.5\ngini = 0.4016\nsamples = 435\nvalue = [121, 314]"] ;
2 -> 3 ;
4 [label="substring=ee <= 0.5\ngini = 0.4414\nsamples = 73\nvalue = [49, 24]"] ;
3 -> 4 ;
5 [label="substring=oy <= 0.5\ngini = 0.4027\nsamples = 68\nvalue = [49, 19]"] ;
4 -> 5 ;
6 [label="substring=im <= 0.5\ngini = 0.3589\nsamples = 64\nvalue = [49, 15]"] ;
5 -> 6 ;
7 [label="lastLetter-firstName=w <= 0.5\ngini = 0.316\nsamples = 61\nvalue = [49, 12]"] ;
6 -> 7 ;
8 [label="firstLetter-firstName=w <= 0.5\ngini = 0.2815\nsamples = 59\nvalue = [49, 10]"] ;
7 -> 8 ;
9 [label="substring=sa <= 0.5\ngini = 0.2221\nsamples = 55\nvalue = [48, 7]"] ;
... many many more lines below

回答1:


  1. In Python you can use DecisionTreeClassifier.feature_importances_, which according to the documentation contains

    The feature importances. The higher, the more important the feature. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance [R66].

    Simply do a np.argsort on the feature importances and you get a feature ranking (ties are not accounted for).

  2. You can look at the Gini impurity (\ngini in the graphviz output) to get a first idea. Lower is better. However, be aware that you will need a way to combine impurity values if a feature is used in more than one split. Typically, this is done by taking the average information gain (or 'purity gain') over all splits on a given feature. This is done for you if you use feature_importances_.

Edit: I see the problem goes deeper than I thought. The graphviz thing is merely a graphical representation of the tree. It shows the tree and every split of the tree in detail. This is a representation of the tree, not of the features. Informativeness (or importance) of the features does not really fit into this representation because it accumulates information over multiple nodes of the tree.

The variable classifierUsed2.feature_importances_ contains importance information for every feature. If you get for example [0, 0.2, 0, 0.1, ...] the first feature has an importance of 0, the second feature has an importance of 0.2, the third feature has an importance of 0, the fourth feature an importance of 0.1, and so on.

Let's sort features by their importance (most important first):

rank = np.argsort(classifierUsed2.feature_importances_)[::-1]

Now rank contains the indices of the features, starting with the most important one: [1, 3, 0, 1, ...]

Want to see the five most important features?

print(rank[:5])

This prints the indices. What index corresponds to what feature? That's something you should know yourself because you supposedly constructed the feature matrix. Chances are, that this works:

print(dv.get_feature_names()[rank[:5]])

Or maybe this:

print('\n'.join(dv.get_feature_names()[i] for i in rank[:5]))



回答2:


As kazemakase already pointed out you can get the most important features using the classifier.feature_importances_:

print(sorted(list(zip(classifierUsed2.feature_importances_, dv.get_feature_names()))))

Just as an addendum, I personally prefer the following printing structure (modified from this question/answer):

# Print Decision rules:
def print_decision_tree(tree, feature_names):
    left      = tree.tree_.children_left
    right     = tree.tree_.children_right
    threshold = tree.tree_.threshold
    features  = [feature_names[i] for i in tree.tree_.feature]
    value = tree.tree_.value

    def recurse(left, right, threshold, features, node, indent=""):
        if (threshold[node] != -2):
            print (indent+"if ( " + features[node] + " <= " + str(threshold[node]) + " ) {")
            if left[node] != -1:
                recurse (left, right, threshold, features,left[node],indent+"   ")
            print (indent+"} else {")
            if right[node] != -1:
                recurse (left, right, threshold, features,right[node],indent+"   ")
            print (indent+"}")
        else:
            print (indent+"return " + str(value[node]))

    recurse(left, right, threshold, features, 0)

# Use it like this:
print_decision_tree(classifierUsed2, dv.get_feature_names())


来源:https://stackoverflow.com/questions/34871212/how-to-interpret-decision-trees-graph-results-and-find-most-informative-feature

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!