PySpark MLLib Random Forest classifier repeatability issue

问题

I am running into this situation where I have no clue what's going with the PySpark Random Forest classifier. I want the model to be reproducible given the same training data. To do so, I added the seed parameter to an integer value as recommended on this page.

https://spark.apache.org/docs/2.4.1/api/java/org/apache/spark/mllib/tree/RandomForest.html.

This seed parameter is the random seed for bootstrapping and choosing feature subsets. Now, I verified the model and they are absolutely identical. But here's the question.

If I reorder the training data or simply shuffle it and run the training process (with the same seed value) it produces a different model. Can anyone help me understand this behavior? I thought that the seed is used for bootstrapping and choosing feature subsets. If that's the case what is causing this random behavior?

It will be really good to understand this and if anyone out there can help - it will be much appreciated. Thanks.

来源：https://stackoverflow.com/questions/61718373/pyspark-mllib-random-forest-classifier-repeatability-issue

标签

apache-spark

pyspark

random-forest

apache-spark-mllib

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!