问题
By default, a scikit-learn DecisionTreeRegressor returns the mean of all target values from the training set in a given leaf node.
However, I am interested in getting back the list of target values from my training set that fell into the predicted leaf node. This will allow me to quantify the distribution, and also calculate other metrics like standard deviation.
Is this possible using scikit-learn?
回答1:
I think what you're looking for is the apply
method of the tree
object. See here for the source. Here's an example:
import numpy as np
from sklearn.tree import DecisionTreeRegressor
rs = np.random.RandomState(1234)
x = rs.randn(10,2)
y = rs.randn(10)
md = rs.randint(1, 5)
dtr = DecisionTreeRegressor(max_depth=md)
dtr.fit(x, y)
# The `tree_` object's methods seem to complain if you don't use `float32.
leaf_ids = dtr.tree_.apply(x.astype(np.float32))
print leaf_ids
# => [5 6 6 5 2 6 3 6 6 3]
# Should be probably be equal for small depths.
print 2**md, np.unique(leaf_ids).shape[0]
# => 4, 4
来源:https://stackoverflow.com/questions/38299015/getting-the-distribution-of-values-at-the-leaf-node-for-a-decisiontreeregressor