Can you get the selected leaf from a DecisionTreeRegressor in scikit-learn

问题

just reading this great paper and trying to implement this:

... We treat each individual tree as a categorical feature that takes as value the index of the leaf an instance ends up falling in. We use 1- of-K coding of this type of features. For example, consider the boosted tree model in Figure 1 with 2 subtrees, where the first subtree has 3 leafs and the second 2 leafs. If an instance ends up in leaf 2 in the first subtree and leaf 1 in second subtree, the overall input to the linear classifier will be the binary vector [0, 1, 0, 1, 0], where the first 3 entries correspond to the leaves of the first subtree and last 2 to those of the second subtree ...

Anyone know how I can predict a bunch of rows and for each of those rows get the selected leaf for each tree in the ensemble? For this use case I don't really care what the node represents, just its index really. Had a look at the source and I could not quickly see anything obvious. I can see that I need to iterate the trees and do something like this:

for sample in X_test:
  for tree in gbc.estimators_:
    leaf = tree.leaf_index(sample) # This is the function I need but don't think exists.
    ...

Any pointers appreciated.

回答1:

DecisionTreeRegressor has tree_ property which gives you access to the underlying decision tree. It has method apply, which seemingly finds corresponding leaf id:

dt.tree_.apply(X)

Note that apply expects its input to have type float32.

回答2:

The following function goes beyond identifying the selected leaf from the Decision Tree and implements the application in the referenced paper. Its use is the same as the referenced paper, where I use the GBC for feature engineering.

def makeTreeBins(gbc, X):
    '''
    Takes in a GradientBoostingClassifier object (gbc) and a data frame (X).
    Returns a numpy array of dim (rows(X), num_estimators), where each row represents the set of terminal nodes
    that the record X[i] falls into across all estimators in the GBC. 

    Note, each tree produces 2^max_depth terminal nodes. I append a prefix to the terminal node id in each incremental
    estimator so that I can use these as feature ids in other classifiers.
    '''
    for i, dt_i in enumerate(gbc.estimators_):

        prefix = (i + 2)*100 #Must be an integer

        nds = prefix + dt_i[0].tree_.apply(np.array(X).astype(np.float32))

        if i == 0:
            nd_mat = nds.reshape(len(nds), 1)        

        else:
            nd_mat = np.hstack((nd, nds.reshape(len(nds), 1)))

return nd_mat

来源：https://stackoverflow.com/questions/27654635/can-you-get-the-selected-leaf-from-a-decisiontreeregressor-in-scikit-learn

标签

python-2.7

machine-learning

scikit-learn

decision-tree