Getting the distribution of values at the leaf node for a DecisionTreeRegressor in scikit-learn

问题

By default, a scikit-learn DecisionTreeRegressor returns the mean of all target values from the training set in a given leaf node.

However, I am interested in getting back the list of target values from my training set that fell into the predicted leaf node. This will allow me to quantify the distribution, and also calculate other metrics like standard deviation.

Is this possible using scikit-learn?

回答1:

I think what you're looking for is the apply method of the tree object. See here for the source. Here's an example:

import numpy as np
from sklearn.tree import DecisionTreeRegressor

rs = np.random.RandomState(1234)
x  = rs.randn(10,2)
y  = rs.randn(10)

md  = rs.randint(1, 5)
dtr = DecisionTreeRegressor(max_depth=md)
dtr.fit(x, y)

# The `tree_` object's methods seem to complain if you don't use `float32.
leaf_ids = dtr.tree_.apply(x.astype(np.float32))

print leaf_ids
# => [5 6 6 5 2 6 3 6 6 3]

# Should be probably be equal for small depths.
print 2**md, np.unique(leaf_ids).shape[0]
# => 4, 4

来源：https://stackoverflow.com/questions/38299015/getting-the-distribution-of-values-at-the-leaf-node-for-a-decisiontreeregressor

标签

python

machine-learning

scikit-learn

random-forest

decision-tree

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!