Why is training a random forest regressor with MAE criterion so slow compared to MSE?

谁说我不能喝 提交于 2020-07-18 10:00:51

问题


When training on even small applications (<50K rows <50 columns) using the mean absolute error criterion for sklearn's RandomForestRegress is nearly 10x slower than using mean squared error. To illustrate even on a small data set:

import time
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import load_boston

X, y = load_boston(return_X_y=True)

def fit_rf_criteria(criterion, X=X, y=y):
    reg = RandomForestRegressor(n_estimators=100,
                                criterion=criterion,
                                n_jobs=-1,
                                random_state=1)
    start = time.time()
    reg.fit(X, y)
    end = time.time()
    print(end - start)

fit_rf_criteria('mse')  # 0.13266682624816895
fit_rf_criteria('mae')  # 1.26043701171875

Why does using the 'mae' criterion take so long for training a RandomForestRegressor? I want to optimize MAE for larger applications, but find the speed of the RandomForestRegressor tuned to this criterion prohibitively slow.


回答1:


Thank you @hellpanderr for sharing a reference to the project issue. To summarize – when the random forest regressor optimizes for MSE it optimizes for the L2-norm and a mean-based impurity metric. But when the regressor uses the MAE criterion it optimizes for the L1-norm which amounts to calculating the median. Unfortunately, sklearn's the regressor's implementation for MAE appears to take O(N^2) currently.



来源:https://stackoverflow.com/questions/57243267/why-is-training-a-random-forest-regressor-with-mae-criterion-so-slow-compared-to

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!