Initializing logistic regression coefficients when using the Spark dataset-based ML APIs?

问题

By default, logistic regression training initializes the coefficients to be all-zero. However, I would like to initialize the coefficients myself. This would be useful, for example, if a previous training run crashed after several iterations -- I could simply restart training with the last known set of coefficients.

Is this possible with any of the dataset/dataframe-based APIs, preferably Scala?

Looking at the Spark source code, it seems that there is a method setInitialModel to initialize the model and its coefficients, but it's unfortunately marked as private.

The RDD-based API seems to allow initializing coefficients: one of the overloads of LogisticRegressionWithSGD.run(...) accepts an initialWeights vector. However, I would like to use the dataset-based API instead of the RDD-based API because (1) the former supports elastic net regularization (I couldn't figure out how to do elastic net with the RDD-based logistic regression) and (2) because the RDD-based API is in maintenance mode.

I could always try using reflection to call that private setInitialModel method, but I would like to avoid this if possible (and maybe that wouldn't even work... I also can't tell if setInitialModel is marked private for a good reason).

回答1:

Feel free to override the method. Yes you will need to copy that class into your own work area. That's fine: do not fear.

When you build your project -either via maven or sbt - your local copy of the class will "win" and shade the one in MLlib. Fortunately the other classes in that same package will not be shaded.

I have used this approach many times with overriding Spark classes: actually your build times should be small as well.

来源：https://stackoverflow.com/questions/44892400/initializing-logistic-regression-coefficients-when-using-the-spark-dataset-based

标签

apache-spark

apache-spark-mllib

apache-spark-ml