问题
I have code which essentially looks like this:
class FoodTrainer(images: S3Path) { // data is >100GB file living in S3
def train(): FoodClassifier // Very expensive - takes ~5 hours!
}
class FoodClassifier { // Light-weight API class
def isHotDog(input: Image): Boolean
}
I want to at JAR-assembly (sbt assembly
) time, invoke val classifier = new FoodTrainer(s3Dir).train()
and publish the JAR which has the classifier
instance instantly available to downstream library users.
What is the easiest way to do this? What are some established paradigms for this? I know its a fairly common idiom in ML projects to publish trained models e.g. http://nlp.stanford.edu/software/stanford-corenlp-models-current.jar
How do I do this using sbt assembly
where I do not have to check in a large model class or data file into my version control?
回答1:
You should serialize the data which results from training into its own file. You can then package this data file in your JAR. Your production code opens the file and reads it rather than run the training algorithm.
回答2:
The steps are as follows.
During the resource generation phase of build:
- Generate model during resource generation phase of build.
- Serialize the contents of the model to a file in a managed resources folder.
resourceGenerators in Compile += Def.task { val classifier = new FoodTrainer(s3Dir).train() val contents = FoodClassifier.serialize(classifier) val file = (resourceManaged in Compile).value / "mypackage" / "food-classifier.model" IO.write(file, contents) Seq(file) }.taskValue
- The resource will be included in
jar
file automatically and it won't appear in source tree. - To load the model just add code that reads resource and parses the model.
object FoodClassifierModel { lazy val classifier = readResource("/mypackage/food-classifier.model") def readResource(resourceName: String): FoodClassifier = { val stream = getClass.getResourceAsStream(resourceName) val lines = scala.io.Source.fromInputStream( stream ).getLines val contents = lines.mkString("\n") FoodClassifier.parse(contents) } } object FoodClassifier { def parse(content: String): FoodClassifier def serialize(classfier: FoodClassifier): String }
Of course, as your data is rather big, you'll need to use streaming serializers and parsers to not overload java heap space. The above just shows how to package resource at build time.
See http://www.scala-sbt.org/1.x/docs/Howto-Generating-Files.html
回答3:
Okay I managed to do this:
Separate the food-trainer module into 2 separate SBT sub-modules:
food-trainer
andfood-model
. The former is only invoked at compile time to create the model and serialize into the generated resources of the latter. The latter serves as a simple factory object to instantiate a model from the serialized version. Every downstream project only depends on thisfood-model
submodule.The
food-trainer
has the bulk of all the code and has a main method that can serialize theFoodModel
:object FoodTrainer { def main(args Array[String]): Unit = { val input = args(0) val outputDir = args(1) val model: FoodModel = new FoodTrainer(input).train() val out = new ObjectOutputStream(new File(outputDir + "/model.bin")) out.writeObject(model) } }
Add a compile-time task to generate the food trainer module in your
build.sbt
:lazy val foodTrainer = (project in file("food-trainer")) lazy val foodModel = (project in file("food-model")) .dependsOn(foodTrainer) .settings( resourceGenerators in Compile += Def.task { val log = streams.value.log val dest = (resourceManaged in Compile).value IO.createDirectory(dest) runModuleMain( cmd = s"com.foo.bar.FoodTrainer $pathToImages ${dest.getAbsolutePath}", cp = (fullClasspath in Runtime in foodTrainer).value.files, log = log ) Seq(dest / "model.bin") } def runModuleMain(cmd: String, cp: Seq[File], log: Logger): Unit = { log.info(s"Running $cmd") val opt = ForkOptions(bootJars = cp, outputStrategy = Some(LoggedOutput(log))) val res = Fork.scala(config = opt, arguments = cmd.split(' ')) require(res == 0, s"$cmd exited with code $res") }
Now in your
food-model
module, you have something like this:object FoodModel { lazy val model: FoodModel = new ObjectInputStream(getClass.getResourceAsStream("/model.bin").readObject().asInstanceOf[FoodModel]) }
Every downstream project now only depends on food-model
and simply uses FoodModel.model
. We get the benefit of:
- This being statically loaded fast at runtime from the JAR's packaged resources
- No need to train the model at runtime (very expensive)
- No need to checking-in the model in your version control (again the binary model is very big) - it is only packaged into your JAR
- No need to separate the
FoodTrainer
andFoodModel
packages into their own JARs (we have the headache of deploying them internally now) - instead we simply keep them in the same project but different sub-modules which gets packed into a single JAR.
回答4:
Here's an idea, throw your model in a resource folder that get's added into the jar assembly. I think all jars get distributed with your model if its in that folder. Lmk how it goes, cheers!
Check this out for reading from resource:
https://www.mkyong.com/java/java-read-a-file-from-resources-folder/
It's in Java but you can still use the api in Scala.
来源:https://stackoverflow.com/questions/47184682/sbt-how-to-package-an-instance-of-a-class-as-a-jar