Problematic Random Forest training runtime when using formula interface

冷暖自知 提交于 2019-12-04 11:01:55

问题


Running the Random Forest example from http://www.kaggle.com/c/icdar2013-gender-prediction-from-handwriting/data, the following line:

forest_model <- randomForest(as.factor(male) ~ ., data=train, ntree=10000)

takes hours (not sure whether it will ever end, but the process does seems to work) .

The data set has 1128 rows and ~7000 variables.

Is it possible to estimate when the Random Forest training will finish? Can I profile R somehow to get more information?


回答1:


One idea, to control the convergence is to use the do.trace for a verbose mode

iris.rf <- randomForest(Species ~ ., data=iris, importance=TRUE,
+                         proximity=TRUE,do.trace=TRUE)
ntree      OOB      1      2      3
    1:   8.62%  0.00%  9.52% 15.00%
    2:   5.49%  0.00%  3.45% 13.79%
    3:   5.45%  0.00%  5.41% 11.76%
    4:   4.72%  0.00%  4.88%  9.30%
    5:   5.11%  0.00%  6.52%  8.89%
    6:   5.56%  2.08%  6.25%  8.33%
    7:   4.76%  0.00%  6.12%  8.16%
    8:   5.41%  0.00%  8.16%  8.16%
 .......



回答2:


Found the problem, using formula in randomForest has created a tremendous performance degradation.

More on this and how to estimate random forest running time can found in: https://stats.stackexchange.com/questions/37370/random-forest-computing-time-in-r and in http://www.gregorypark.org/?p=286

Here is final code:

forest_model <- randomForest(y=train$male, x=train[,-2], ntree=10000,do.trace=T)


来源:https://stackoverflow.com/questions/15321947/problematic-random-forest-training-runtime-when-using-formula-interface

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!