SBT: How to package an instance of a class as a JAR?

问题

I have code which essentially looks like this:

class FoodTrainer(images: S3Path) { // data is >100GB file living in S3
  def train(): FoodClassifier       // Very expensive - takes ~5 hours!
}

class FoodClassifier {          // Light-weight API class
  def isHotDog(input: Image): Boolean
}

I want to at JAR-assembly (sbt assembly) time, invoke val classifier = new FoodTrainer(s3Dir).train() and publish the JAR which has the classifier instance instantly available to downstream library users.

What is the easiest way to do this? What are some established paradigms for this? I know its a fairly common idiom in ML projects to publish trained models e.g. http://nlp.stanford.edu/software/stanford-corenlp-models-current.jar

How do I do this using sbt assembly where I do not have to check in a large model class or data file into my version control?

回答1:

You should serialize the data which results from training into its own file. You can then package this data file in your JAR. Your production code opens the file and reads it rather than run the training algorithm.

回答2:

The steps are as follows.

During the resource generation phase of build:

Generate model during resource generation phase of build.

Serialize the contents of the model to a file in a managed resources folder.

resourceGenerators in Compile += Def.task {
  val classifier = new FoodTrainer(s3Dir).train()
  val contents = FoodClassifier.serialize(classifier)
  val file = (resourceManaged in Compile).value / "mypackage" / "food-classifier.model"
  IO.write(file, contents)
  Seq(file)
}.taskValue

The resource will be included in jar file automatically and it won't appear in source tree.

To load the model just add code that reads resource and parses the model.

object FoodClassifierModel {
  lazy val classifier = readResource("/mypackage/food-classifier.model")
  def readResource(resourceName: String): FoodClassifier = {
    val stream = getClass.getResourceAsStream(resourceName)
    val lines = scala.io.Source.fromInputStream( stream ).getLines
    val contents = lines.mkString("\n")
    FoodClassifier.parse(contents)
  }
}
object FoodClassifier {
  def parse(content: String): FoodClassifier
  def serialize(classfier: FoodClassifier): String
}

Of course, as your data is rather big, you'll need to use streaming serializers and parsers to not overload java heap space. The above just shows how to package resource at build time.

See http://www.scala-sbt.org/1.x/docs/Howto-Generating-Files.html

回答3:

Okay I managed to do this:

Separate the food-trainer module into 2 separate SBT sub-modules: food-trainer and food-model. The former is only invoked at compile time to create the model and serialize into the generated resources of the latter. The latter serves as a simple factory object to instantiate a model from the serialized version. Every downstream project only depends on this food-model submodule.

The food-trainer has the bulk of all the code and has a main method that can serialize the FoodModel:

object FoodTrainer {
  def main(args Array[String]): Unit = {
    val input = args(0)
    val outputDir = args(1)
    val model: FoodModel = new FoodTrainer(input).train() 
    val out = new ObjectOutputStream(new File(outputDir + "/model.bin"))
    out.writeObject(model)
  }
}

Add a compile-time task to generate the food trainer module in your build.sbt:

lazy val foodTrainer = (project in file("food-trainer"))

lazy val foodModel = (project in file("food-model"))
  .dependsOn(foodTrainer)
  .settings(    
     resourceGenerators in Compile += Def.task {
       val log = streams.value.log
       val dest = (resourceManaged in Compile).value   
       IO.createDirectory(dest)
       runModuleMain(
         cmd = s"com.foo.bar.FoodTrainer $pathToImages ${dest.getAbsolutePath}",
         cp = (fullClasspath in Runtime in foodTrainer).value.files,
         log = log
       )             
      Seq(dest / "model.bin")
    }

def runModuleMain(cmd: String, cp: Seq[File], log: Logger): Unit = {
  log.info(s"Running $cmd")
  val opt = ForkOptions(bootJars = cp, outputStrategy = Some(LoggedOutput(log)))
  val res = Fork.scala(config = opt, arguments = cmd.split(' '))
  require(res == 0, s"$cmd exited with code $res")
}

Now in your food-model module, you have something like this:

object FoodModel {
  lazy val model: FoodModel =
    new ObjectInputStream(getClass.getResourceAsStream("/model.bin").readObject().asInstanceOf[FoodModel])
}

Every downstream project now only depends on food-model and simply uses FoodModel.model. We get the benefit of:

This being statically loaded fast at runtime from the JAR's packaged resources
No need to train the model at runtime (very expensive)
No need to checking-in the model in your version control (again the binary model is very big) - it is only packaged into your JAR
No need to separate the FoodTrainer and FoodModel packages into their own JARs (we have the headache of deploying them internally now) - instead we simply keep them in the same project but different sub-modules which gets packed into a single JAR.

回答4:

Here's an idea, throw your model in a resource folder that get's added into the jar assembly. I think all jars get distributed with your model if its in that folder. Lmk how it goes, cheers!

Check this out for reading from resource:

https://www.mkyong.com/java/java-read-a-file-from-resources-folder/

It's in Java but you can still use the api in Scala.

来源：https://stackoverflow.com/questions/47184682/sbt-how-to-package-an-instance-of-a-class-as-a-jar

标签

java