Pruning Decision Trees

后端 未结 4 1056
梦毁少年i
梦毁少年i 2020-12-13 07:14

Below is a snippet of the decision tree as it is pretty huge.

How to make the tree stop growing when the lowest value in a node is under 5. H

相关标签:
4条回答
  • 2020-12-13 07:46

    Interestingly, min_impurity_decrease doesn't look as if it would allow growth of any of the nodes you have shown in the snippet you provided (the sum of impurities after splitting equals the pre-split impurity, so there is no impurity decrease). However, while it won't give you exactly the result you want (terminate node if lowest value is under 5), it may give you something similar.

    If my testing is right, the official docs make it look more complicated than it actually is. Just take the lower value from the potential parent node, then subtract the sum of the lower values of the proposed new nodes - this is the gross impurity reduction. Then divide by the total number of samples in the whole tree - this gives you the fractional impurity decrease achieved if the node is split.

    If you have 1000 samples, and a node with a lower value of 5 (i.e. 5 "impurities"), 5/1000 represents the maximum impurity decrease you could achieve if this node was perfectly split. So setting a min_impurity_decrease of of 0.005 would approximate stopping the leaf with <5 impurities. It would actually stop most leaves with a bit more than 5 impurities (depending upon the impurities resulting from the proposed split), so it is only an approximation, but as best I can tell its the closest you can get without post-pruning.

    0 讨论(0)
  • 2020-12-13 07:47

    In Scikit learn library, you have parameter called ccp_alpha as parameter for DescissionTreeClassifier. Using this you can do post-compexity-pruning for DecessionTrees. Check this out https://scikit-learn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html

    0 讨论(0)
  • 2020-12-13 07:51

    Edit : This is not correct as @SBylemans and @Viktor point out in the comments. I'm not deleting the answer since someone else may also think this is the solution.

    Set min_samples_leaf to 5.

    min_samples_leaf :

    The minimum number of samples required to be at a leaf node:

    Update : I think it cannot be done with min_impurity_decrease. Think of the following scenario :

          11/9
       /         \
      6/4       5/5
     /   \     /   \
    6/0  0/4  2/2  3/3
    

    According to your rule, you do not want to split node 6/4 since 4 is less than 5 but you want to split 5/5 node. However, splitting 6/4 node has 0.48 information gain and splitting 5/5 has 0 information gain.

    0 讨论(0)
  • 2020-12-13 07:59

    Directly restricting the lowest value (number of occurences of a particular class) of a leaf cannot be done with min_impurity_decrease or any other built-in stopping criteria.

    I think the only way you can accomplish this without changing the source code of scikit-learn is to post-prune your tree. To accomplish this, you can just traverse the tree and remove all children of the nodes with minimum class count less that 5 (or any other condition you can think of). I will continue your example:

    from sklearn.tree._tree import TREE_LEAF
    
    def prune_index(inner_tree, index, threshold):
        if inner_tree.value[index].min() < threshold:
            # turn node into a leaf by "unlinking" its children
            inner_tree.children_left[index] = TREE_LEAF
            inner_tree.children_right[index] = TREE_LEAF
        # if there are shildren, visit them as well
        if inner_tree.children_left[index] != TREE_LEAF:
            prune_index(inner_tree, inner_tree.children_left[index], threshold)
            prune_index(inner_tree, inner_tree.children_right[index], threshold)
    
    print(sum(dt.tree_.children_left < 0))
    # start pruning from the root
    prune_index(dt.tree_, 0, 5)
    sum(dt.tree_.children_left < 0)
    

    this code will print first 74, and then 91. It means that the code has created 17 new leaf nodes (by practically removing links to their ancestors). The tree, which has looked before like

    now looks like

    so you can see that is indeed has decreased a lot.

    0 讨论(0)
提交回复
热议问题