how to obtain the trained best model from a crossvalidator

浪子不回头ぞ 提交于 2019-12-04 06:48:14
zero323

Well, in cases like this one the answer is always the same - be specific about the types.

First extract the pipeline model, since what you are trying to train is a Pipeline :

import org.apache.spark.ml.PipelineModel

val bestModel: Option[PipelineModel] = model.bestModel match {
  case p: PipelineModel => Some(p)
  case _ => None
}

Then you'll need to extract the model from the underlying stage. In your case it's a decision tree classification model :

import org.apache.spark.ml.classification.DecisionTreeClassificationModel

val treeModel: Option[DecisionTreeClassificationModel] = bestModel
  flatMap {
    _.stages.collect {
      case t: DecisionTreeClassificationModel => t
    }.headOption
  }

To print the tree, for example :

treeModel.foreach(_.toDebugString)

(DISCLAIMER: There is another aspect, which imho deserves its own answer. I know it is a little OT given the question, however, it questions the question. If somebody down votes because he disagrees with the content please also leave a comment)

Should you extract the "best" tree and the answer is typically no.

Why are we doing CV? We are trying to evaluate our choices, to get. The choices are the classifiers used, hyper parameter used, preprocessing like feature selection. For the last one it is important that this happens on the training data. E.g., do not normalise the features on all data. So the output of CV is the pipeline generated. On a side note: the feature selection should evaluated on a "internal cv"

What we are not doing, we are not generating a "pool of classifiers" where we choose the best classifier. However, i've seen this surprisingly often. The problem is that you have an extremely high chance of a twining-effect. Even in a perfectly Iid dataset there are likely (near)duplicated training examples. There is a pretty good chance that the "best" CV classifier is just an indication in which fold you have the best twining.

Hence, what should you do? Once, you have fixed your parameters you should use the entire training data to build the final model. Hopefully, but nobody does this, you have set aside an additional evaluation set, which you have never touched in the process to get an evaluation of your final model.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!