How to interpret decision trees' graph results and find most informative features?

问题

I am using sk-learn python 27 and have output some decision tree feature results. Though I am not sure how to interpret the results. At first, I thought the features are listed from the most informative to least informative (from top to bottom), but examining the \nvalue it suggests otherwise. How do I identify the top 5 most informative features from the outputs or using python lines?

from sklearn import tree

tree.export_graphviz(classifierUsed2, feature_names=dv.get_feature_names(), out_file=treeFileName)     

# Output below
digraph Tree {
node [shape=box] ;
0 [label="avg-length <= 3.5\ngini = 0.0063\nsamples = 250000\nvalue = [249210, 790]"] ;
1 [label="name-entity <= 2.5\ngini = 0.5\nsamples = 678\nvalue = [338, 340]"] ;
0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"] ;
2 [label="first-name=wm <= 0.5\ngini = 0.4537\nsamples = 483\nvalue = [168, 315]"] ;
1 -> 2 ;
3 [label="name-entity <= 1.5\ngini = 0.4016\nsamples = 435\nvalue = [121, 314]"] ;
2 -> 3 ;
4 [label="substring=ee <= 0.5\ngini = 0.4414\nsamples = 73\nvalue = [49, 24]"] ;
3 -> 4 ;
5 [label="substring=oy <= 0.5\ngini = 0.4027\nsamples = 68\nvalue = [49, 19]"] ;
4 -> 5 ;
6 [label="substring=im <= 0.5\ngini = 0.3589\nsamples = 64\nvalue = [49, 15]"] ;
5 -> 6 ;
7 [label="lastLetter-firstName=w <= 0.5\ngini = 0.316\nsamples = 61\nvalue = [49, 12]"] ;
6 -> 7 ;
8 [label="firstLetter-firstName=w <= 0.5\ngini = 0.2815\nsamples = 59\nvalue = [49, 10]"] ;
7 -> 8 ;
9 [label="substring=sa <= 0.5\ngini = 0.2221\nsamples = 55\nvalue = [48, 7]"] ;
... many many more lines below

回答1:

In Python you can use DecisionTreeClassifier.feature_importances_, which according to the documentation contains

The feature importances. The higher, the more important the feature. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance [R66].

Simply do a np.argsort on the feature importances and you get a feature ranking (ties are not accounted for).
You can look at the Gini impurity (\ngini in the graphviz output) to get a first idea. Lower is better. However, be aware that you will need a way to combine impurity values if a feature is used in more than one split. Typically, this is done by taking the average information gain (or 'purity gain') over all splits on a given feature. This is done for you if you use feature_importances_.

Edit: I see the problem goes deeper than I thought. The graphviz thing is merely a graphical representation of the tree. It shows the tree and every split of the tree in detail. This is a representation of the tree, not of the features. Informativeness (or importance) of the features does not really fit into this representation because it accumulates information over multiple nodes of the tree.

The variable classifierUsed2.feature_importances_ contains importance information for every feature. If you get for example [0, 0.2, 0, 0.1, ...] the first feature has an importance of 0, the second feature has an importance of 0.2, the third feature has an importance of 0, the fourth feature an importance of 0.1, and so on.

Let's sort features by their importance (most important first):

rank = np.argsort(classifierUsed2.feature_importances_)[::-1]

Now rank contains the indices of the features, starting with the most important one: [1, 3, 0, 1, ...]

Want to see the five most important features?

print(rank[:5])

This prints the indices. What index corresponds to what feature? That's something you should know yourself because you supposedly constructed the feature matrix. Chances are, that this works:

print(dv.get_feature_names()[rank[:5]])

Or maybe this:

print('\n'.join(dv.get_feature_names()[i] for i in rank[:5]))

回答2:

As kazemakase already pointed out you can get the most important features using the classifier.feature_importances_:

print(sorted(list(zip(classifierUsed2.feature_importances_, dv.get_feature_names()))))

Just as an addendum, I personally prefer the following printing structure (modified from this question/answer):

# Print Decision rules:
def print_decision_tree(tree, feature_names):
    left      = tree.tree_.children_left
    right     = tree.tree_.children_right
    threshold = tree.tree_.threshold
    features  = [feature_names[i] for i in tree.tree_.feature]
    value = tree.tree_.value

    def recurse(left, right, threshold, features, node, indent=""):
        if (threshold[node] != -2):
            print (indent+"if ( " + features[node] + " <= " + str(threshold[node]) + " ) {")
            if left[node] != -1:
                recurse (left, right, threshold, features,left[node],indent+"   ")
            print (indent+"} else {")
            if right[node] != -1:
                recurse (left, right, threshold, features,right[node],indent+"   ")
            print (indent+"}")
        else:
            print (indent+"return " + str(value[node]))

    recurse(left, right, threshold, features, 0)

# Use it like this:
print_decision_tree(classifierUsed2, dv.get_feature_names())

来源：https://stackoverflow.com/questions/34871212/how-to-interpret-decision-trees-graph-results-and-find-most-informative-feature

标签

python-2.7

machine-learning

scikit-learn

decision-tree