问题
I am running into this situation where I have no clue what's going with the PySpark Random Forest classifier. I want the model to be reproducible given the same training data. To do so, I added the seed
parameter to an integer value as recommended on this page.
https://spark.apache.org/docs/2.4.1/api/java/org/apache/spark/mllib/tree/RandomForest.html.
This seed
parameter is the random seed for bootstrapping and choosing feature subsets. Now, I verified the model and they are absolutely identical. But here's the question.
If I reorder the training data or simply shuffle it and run the training process (with the same seed value) it produces a different model. Can anyone help me understand this behavior? I thought that the seed is used for bootstrapping and choosing feature subsets. If that's the case what is causing this random behavior?
It will be really good to understand this and if anyone out there can help - it will be much appreciated. Thanks.
来源:https://stackoverflow.com/questions/61718373/pyspark-mllib-random-forest-classifier-repeatability-issue