random_state parameter in classification models

北城以北 提交于 2021-02-16 14:25:07

问题


Can someone explain why does the random_state parameter affects the model so much?

I have a RandomForestClassifier model and want to set the random_state (for reproducibility pourpouses), but depending on the value I use I get very different values on my overall evaluation metric (F1 score)

For example, I tried to fit the same model with 100 different random_state values and after the training ad testing the smallest F1 was 0.64516129 and the largest 0.808823529). That is a huge difference.

This behaviour also seems to make very hard to compare two models.

Thoughts?


回答1:


If the random_state affects your results it means that your model has a high variance. In case of Random Forest this simply means that you use too small forest and should increase number of trees (which due to bagging - reduce variance). In scikit-learn this is controlled by n_estimators parameters in the constructor.

Why this happens? Each ML method tries to minimize the error, which from matematial perspective can be usually decomposed to bias and variance [+noise] (see bias variance dillema/tradeoff). Bias is simply how far from true values your model has to end up in the expectation - this part of an error usually comes from some prior assumptions, such as using linear model for nonlinear problem etc. Variance is how much your results differ when you train on different subsets of data (or use different hyperparameters, and in case of randomized methods random seed is a parameter). Hyperparameters are initialized by us and Parameters are learnt by the model itself in the training process. Finally - noise is not reducible error coming from the problem itself (or data representation). Thus, in your case - you simply encountered model with high variance, decision trees are well known for their extremely high variance (and small bias). Thus to reduce variance, Breiman proposed the specific bagging method, known today as Random Forest. The larger the forest - stronger the effect of variance reduction. In particular - forest with 1 tree has huge variance, forest of 1000 trees is nearly deterministic for moderate size problems.

To sum up, what you can do?

  • Increase number of trees - this has to work, and is well understood and justified method
  • treat random_seed as a hyperparameter during your evaluation, because this is exactly this - a meta knowledge you need to fix before hand if you do not wish to increase size of the forest.


来源:https://stackoverflow.com/questions/35944725/random-state-parameter-in-classification-models

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!