Pyspark - Get all parameters of models created with ParamGridBuilder

匿名 (未验证) 提交于 2019-12-03 02:23:02

问题:

I'm using PySpark 2.0 for a Kaggle competition. I'd like to know the behavior of a model (RandomForest) depending on different parameters. ParamGridBuilder() allows to specify different values for a single parameters, and then perform (I guess) a Cartesian product of the entire set of parameters. Assuming my DataFrame is already defined:

rdc = RandomForestClassifier() pipeline = Pipeline(stages=STAGES + [rdc]) paramGrid = ParamGridBuilder().addGrid(rdc.maxDepth, [3, 10, 20])                               .addGrid(rdc.minInfoGain, [0.01, 0.001])                               .addGrid(rdc.numTrees, [5, 10, 20, 30])                               .build() evaluator = MulticlassClassificationEvaluator() valid = TrainValidationSplit(estimator=pipeline,                              estimatorParamMaps=paramGrid,                              evaluator=evaluator,                              trainRatio=0.50) model = valid.fit(df) result = model.bestModel.transform(df) 

OK so now I'm able to retrieves simple information with a handmade function:

def evaluate(result):     predictionAndLabels = result.select("prediction", "label")     metrics = ["f1","weightedPrecision","weightedRecall","accuracy"]     for m in metrics:         evaluator = MulticlassClassificationEvaluator(metricName=m)         print(str(m) + ": " + str(evaluator.evaluate(predictionAndLabels))) 

Now I want several things:

  • What are the parameters of the best model? This post partially answers the question: How to extract model hyper-parameters from spark.ml in PySpark?
  • What are the parameters of all models?
  • What are the results (aka recall, accuracy, etc...) of each model ? I only found print(model.validationMetrics) that displays (it seems) a list containing the accuracy of each model, but I can't get to know which model to refers.

If I can retrieve all those informations, I should be able to display graphs, bar charts, and work as I do with Panda and sklearn.

回答1:

Long story short you simply cannot get parameters for all models because, similarly to CrossValidator, TrainValidationSplitModel retains only the best model. These classes are designed for semi-automated model selection not exploration or experiments.

What are the parameters of all models?

While you cannot retrieve actual models validationMetrics correspond to input Params so you should be able to simply zip both:

from typing import Dict, Tuple, List, Any from pyspark.ml.param import Param from pyspark.ml.tuning import TrainValidationSplitModel  EvalParam = List[Tuple[float, Dict[Param, Any]]]  def get_metrics_and_params(model: TrainValidationSplitModel) -> EvalParam:     return list(zip(model.validationMetrics, model.getEstimatorParamMaps())) 

to get some about relationship between metrics and parameters.

If you need more information you should use Pipeline Params. It will preserve all model which can be used for further processing:

models = pipeline.fit(df, params=paramGrid) 

It will generate a list of the PipelineModels corresponding to the params argument:

zip(models, params) 


回答2:

I think I've found a way to do this. I wrote a function that specifically pulls out hyperparameters for a logistic regression that has two parameters, created with a CrossValidator:

def hyperparameter_getter(model_obj,cv_fold = 5.0):      enet_list = []     reg_list  = []      ## Get metrics      metrics = model_obj.avgMetrics     assert type(metrics) is list     assert len(metrics) > 0      ## Get the paramMap element      for x in range(len(model_obj._paramMap.keys())):     if model_obj._paramMap.keys()[x].name=='estimatorParamMaps':         param_map_key = model_obj._paramMap.keys()[x]      params = model_obj._paramMap[param_map_key]      for i in range(len(params)):     for k in params[i].keys():         if k.name =='elasticNetParam':         enet_list.append(params[i][k])         if k.name =='regParam':         reg_list.append(params[i][k])      results_df =  pd.DataFrame({'metrics':metrics,               'elasticNetParam': enet_list,               'regParam':reg_list})      # Because of [SPARK-16831][PYTHON]      # It only sums across folds, doesn't average     spark_version = [int(x) for x in sc.version.split('.')]      if spark_version[0] <= 2:     if spark_version[1] < 1:         results_df.metrics = 1.0*results_df['metrics'] / cv_fold      return results_df 


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!