I\'m trying to perform clustering in Python using Random Forests. In the R implementation of Random Forests, there is a flag you can set to get the proximity matrix. I can\'
Based on Gilles Louppe answer I have written a function. I don't know if it is effective, but it works. Best regards.
def proximityMatrix(model, X, normalize=True):
terminals = model.apply(X)
nTrees = terminals.shape[1]
a = terminals[:,0]
proxMat = 1*np.equal.outer(a, a)
for i in range(1, nTrees):
a = terminals[:,i]
proxMat += 1*np.equal.outer(a, a)
if normalize:
proxMat = proxMat / nTrees
return proxMat
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
train = load_breast_cancer()
model = RandomForestClassifier(n_estimators=500, max_features=2, min_samples_leaf=40)
model.fit(train.data, train.target)
proximityMatrix(model, train.data, normalize=True)
## array([[ 1. , 0.414, 0.77 , ..., 0.146, 0.79 , 0.002],
## [ 0.414, 1. , 0.362, ..., 0.334, 0.296, 0.008],
## [ 0.77 , 0.362, 1. , ..., 0.218, 0.856, 0. ],
## ...,
## [ 0.146, 0.334, 0.218, ..., 1. , 0.21 , 0.028],
## [ 0.79 , 0.296, 0.856, ..., 0.21 , 1. , 0. ],
## [ 0.002, 0.008, 0. , ..., 0.028, 0. , 1. ]])