Can you get the selected leaf from a DecisionTreeRegressor in scikit-learn

时光怂恿深爱的人放手 提交于 2019-12-23 06:01:51

问题


just reading this great paper and trying to implement this:

... We treat each individual tree as a categorical feature that takes as value the index of the leaf an instance ends up falling in. We use 1- of-K coding of this type of features. For example, consider the boosted tree model in Figure 1 with 2 subtrees, where the first subtree has 3 leafs and the second 2 leafs. If an instance ends up in leaf 2 in the first subtree and leaf 1 in second subtree, the overall input to the linear classifier will be the binary vector [0, 1, 0, 1, 0], where the first 3 entries correspond to the leaves of the first subtree and last 2 to those of the second subtree ...

Anyone know how I can predict a bunch of rows and for each of those rows get the selected leaf for each tree in the ensemble? For this use case I don't really care what the node represents, just its index really. Had a look at the source and I could not quickly see anything obvious. I can see that I need to iterate the trees and do something like this:

for sample in X_test:
  for tree in gbc.estimators_:
    leaf = tree.leaf_index(sample) # This is the function I need but don't think exists.
    ...

Any pointers appreciated.


回答1:


DecisionTreeRegressor has tree_ property which gives you access to the underlying decision tree. It has method apply, which seemingly finds corresponding leaf id:

dt.tree_.apply(X)

Note that apply expects its input to have type float32.




回答2:


The following function goes beyond identifying the selected leaf from the Decision Tree and implements the application in the referenced paper. Its use is the same as the referenced paper, where I use the GBC for feature engineering.

def makeTreeBins(gbc, X):
    '''
    Takes in a GradientBoostingClassifier object (gbc) and a data frame (X).
    Returns a numpy array of dim (rows(X), num_estimators), where each row represents the set of terminal nodes
    that the record X[i] falls into across all estimators in the GBC. 

    Note, each tree produces 2^max_depth terminal nodes. I append a prefix to the terminal node id in each incremental
    estimator so that I can use these as feature ids in other classifiers.
    '''
    for i, dt_i in enumerate(gbc.estimators_):

        prefix = (i + 2)*100 #Must be an integer

        nds = prefix + dt_i[0].tree_.apply(np.array(X).astype(np.float32))

        if i == 0:
            nd_mat = nds.reshape(len(nds), 1)        

        else:
            nd_mat = np.hstack((nd, nds.reshape(len(nds), 1)))

return nd_mat


来源:https://stackoverflow.com/questions/27654635/can-you-get-the-selected-leaf-from-a-decisiontreeregressor-in-scikit-learn

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!