I am using a classification tree from sklearn
and when I have the the model train twice using the same data, and predict with the same test data, I am getting diffe
The answer provided by Matt Krause does not answer the question entirely correctly.
The reason for the observed behaviour in scikit-learn's DecisionTreeClassifier
is explained in this issue on GitHub.
When using the default settings, all features are considered at each split. This is governed by the max_features
parameter, which specifies how many features should be considered at each split. At each node, the classifier randomly samples max_features
without replacement (!).
Thus, when using max_features=n_features
, all features are considered at each split. However, the implementation will still sample them at random from the list of features (even though this means all features will be sampled, in this case). Thus, the order in which the features are considered is pseudo-random. If two possible splits are tied, the first one encountered will be used as the best split.
This is exactly the reason why your decision tree is yielding different results each time you call it: the order of features considered is randomized at each node, and when two possible splits are then tied, the split to use will depend on which one was considered first.
As has been said before, the seed used for the randomization can be specified using the random_state
parameter.