Classification tree in sklearn giving inconsistent answers

前端 未结 4 1222
忘了有多久
忘了有多久 2021-02-09 12:44

I am using a classification tree from sklearn and when I have the the model train twice using the same data, and predict with the same test data, I am getting diffe

4条回答
  •  刺人心
    刺人心 (楼主)
    2021-02-09 13:22

    The answer provided by Matt Krause does not answer the question entirely correctly.

    The reason for the observed behaviour in scikit-learn's DecisionTreeClassifier is explained in this issue on GitHub.

    When using the default settings, all features are considered at each split. This is governed by the max_features parameter, which specifies how many features should be considered at each split. At each node, the classifier randomly samples max_features without replacement (!).

    Thus, when using max_features=n_features, all features are considered at each split. However, the implementation will still sample them at random from the list of features (even though this means all features will be sampled, in this case). Thus, the order in which the features are considered is pseudo-random. If two possible splits are tied, the first one encountered will be used as the best split.

    This is exactly the reason why your decision tree is yielding different results each time you call it: the order of features considered is randomized at each node, and when two possible splits are then tied, the split to use will depend on which one was considered first.

    As has been said before, the seed used for the randomization can be specified using the random_state parameter.

提交回复
热议问题