Initializing logistic regression coefficients when using the Spark dataset-based ML APIs?

China☆狼群 提交于 2019-12-24 02:05:14

问题


By default, logistic regression training initializes the coefficients to be all-zero. However, I would like to initialize the coefficients myself. This would be useful, for example, if a previous training run crashed after several iterations -- I could simply restart training with the last known set of coefficients.

Is this possible with any of the dataset/dataframe-based APIs, preferably Scala?

Looking at the Spark source code, it seems that there is a method setInitialModel to initialize the model and its coefficients, but it's unfortunately marked as private.

The RDD-based API seems to allow initializing coefficients: one of the overloads of LogisticRegressionWithSGD.run(...) accepts an initialWeights vector. However, I would like to use the dataset-based API instead of the RDD-based API because (1) the former supports elastic net regularization (I couldn't figure out how to do elastic net with the RDD-based logistic regression) and (2) because the RDD-based API is in maintenance mode.

I could always try using reflection to call that private setInitialModel method, but I would like to avoid this if possible (and maybe that wouldn't even work... I also can't tell if setInitialModel is marked private for a good reason).


回答1:


Feel free to override the method. Yes you will need to copy that class into your own work area. That's fine: do not fear.

When you build your project -either via maven or sbt - your local copy of the class will "win" and shade the one in MLlib. Fortunately the other classes in that same package will not be shaded.

I have used this approach many times with overriding Spark classes: actually your build times should be small as well.



来源:https://stackoverflow.com/questions/44892400/initializing-logistic-regression-coefficients-when-using-the-spark-dataset-based

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!