问题
When training on even small applications (<50K rows <50 columns) using the mean absolute error criterion for sklearn's RandomForestRegress is nearly 10x slower than using mean squared error. To illustrate even on a small data set:
import time
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import load_boston
X, y = load_boston(return_X_y=True)
def fit_rf_criteria(criterion, X=X, y=y):
reg = RandomForestRegressor(n_estimators=100,
criterion=criterion,
n_jobs=-1,
random_state=1)
start = time.time()
reg.fit(X, y)
end = time.time()
print(end - start)
fit_rf_criteria('mse') # 0.13266682624816895
fit_rf_criteria('mae') # 1.26043701171875
Why does using the 'mae' criterion take so long for training a RandomForestRegressor? I want to optimize MAE for larger applications, but find the speed of the RandomForestRegressor tuned to this criterion prohibitively slow.
回答1:
Thank you @hellpanderr for sharing a reference to the project issue. To summarize – when the random forest regressor optimizes for MSE it optimizes for the L2-norm and a mean-based impurity metric. But when the regressor uses the MAE criterion it optimizes for the L1-norm which amounts to calculating the median. Unfortunately, sklearn's the regressor's implementation for MAE appears to take O(N^2) currently.
来源:https://stackoverflow.com/questions/57243267/why-is-training-a-random-forest-regressor-with-mae-criterion-so-slow-compared-to