How to get the maxDepth from a Spark RandomForestRegressionModel

问题

In Spark (2.1.0) I've used a CrossValidator to train a RandomForestRegressor, using ParamGridBuilder for maxDepth and numTrees:

paramGrid = ParamGridBuilder() \
    .addGrid(rf.maxDepth, [2, 4, 6, 8, 10]) \
    .addGrid(rf.numTrees, [10, 20, 40, 50]) \
    .build()

After training, I can get the best number of trees:

regressor = cvModel.bestModel.stages[len(cvModel.bestModel.stages) - 1]

print(regressor.getNumTrees)

but I can't work out how to get the best maxDepth. I've read the documentation and I don't see what I'm missing.

I'd note that I can iterate through all the trees and find the depth of each one, eg

regressor.trees[0].depth

This seems like I'm missing something though.

回答1:

Unfortunately PySpark RandomForestRegressionModel before Spark 2.3, unlike its Scala counterpart, doesn't store upstream Estimator Params, but you should be able to retrieve it directly from the JVM object. With a simple monkey patch:

from pyspark.ml.regression import RandomForestRegressionModel

RandomForestRegressionModel.getMaxDepth = (
    lambda self: self._java_obj.getMaxDepth()
)

you can:

cvModel.bestModel.stages[-1].getMaxDepth()

回答2:

Even simpler, just call

    cvModel.bestModel.stages[-1]._java_obj.getMaxDepth()

As @user6910411 explained, you get the bestModel, call the JVM object of this model and extract your parameter using getMaxDepth() from the JVM object. Similar works for other parameters.

来源：https://stackoverflow.com/questions/41690093/how-to-get-the-maxdepth-from-a-spark-randomforestregressionmodel

标签

apache-spark

pyspark

apache-spark-mllib