Cannot evaluate ML model on Structured Streaming, because RDD transformations and actions are invoked inside other transformations

问题

This is a well-known limitation[1] of Structured Streaming that I'm trying to get around using a custom sink.

In what follows, modelsMap is a map of string keys to org.apache.spark.mllib.stat.KernelDensity models

and streamingData is a streaming dataframe org.apache.spark.sql.DataFrame = [id1: string, id2: string ... 6 more fields]

I'm trying to evaluate each row of streamingData against its corresponding model from modelsMap, enhance each row with prediction, and write to Kakfa.

An obvious way would be .withColumn, using a UDF to predict, and write using kafka sink.

But this is illegal because:

org.apache.spark.SparkException: This RDD lacks a SparkContext. It 
could happen in the following cases: (1) RDD transformations and 
actions are NOT invoked by the driver, but inside of other 
transformations; for example, rdd1.map(x => rdd2.values.count() * x) is 
invalid because the values transformation and count action cannot be 
performed inside of the rdd1.map transformation. For more information, 
see SPARK-5063.

I get the same error with a custom sink that implements forEachWriter which was a bit unexpected:

    import org.apache.spark.sql.ForeachWriter
    import java.util.Properties
    import kafkashaded.org.apache.kafka.clients.producer._

     class  customSink(topic:String, servers:String) extends ForeachWriter[(org.apache.spark.sql.Row)] {
          val kafkaProperties = new Properties()
          kafkaProperties.put("bootstrap.servers", servers)
          kafkaProperties.put("key.serializer", "kafkashaded.org.apache.kafka.common.serialization.StringSerializer")
          kafkaProperties.put("value.serializer", "kafkashaded.org.apache.kafka.common.serialization.StringSerializer")
          val results = new scala.collection.mutable.HashMap[String, String]
          var producer: KafkaProducer[String, String] = _

          def open(partitionId: Long,version: Long): Boolean = {
            producer = new KafkaProducer(kafkaProperties)
            true
          }

          def process(value: (org.apache.spark.sql.Row)): Unit = {
            var prediction = Double.NaN
            try {
                val id1 = value(0)
                val id2 = value(3)
                val id3 = value(5)
                val time_0 = value(6).asInstanceOf[Double]
                val key = f"$id1/$id2/$id3" 
                var model = modelsMap(key)
                println("Looking up key: ",key)
                var prediction = Double.NaN
                prediction = model.estimate(Array[Double](time_0))(0)
                println(prediction)
            } catch {
                case e: NoSuchElementException =>
                val prediction = Double.NaN
                println(prediction)
            }    
              producer.send(new ProducerRecord(topic, value.mkString(",")+","+prediction.toString))
          }

          def close(errorOrNull: Throwable): Unit = {
            producer.close()
          }
       }

val writer = new customSink("<broker>", "<topic>")

val query = streamingData
.writeStream
.foreach(writer)
.outputMode("update")
.trigger(Trigger.ProcessingTime(10.seconds))
.start()

model.estimate is implemented under the hood using aggregate in mllib.stat, and there's no way to get around it.

What changes do I make? (I could collect each batch and execute a for loop using driver, but then I'm not using spark the way it's intended)

References:

https://www.slideshare.net/databricks/realtime-machine-learning-analytics-using-structured-streaming-and-kinesis-firehose slide#11 mentions limitations
https://www.oreilly.com/learning/extend-structured-streaming-for-spark-ml
https://github.com/holdenk/spark-structured-streaming-ml (proposed solution)
https://issues.apache.org/jira/browse/SPARK-16454
https://issues.apache.org/jira/browse/SPARK-16407

来源：https://stackoverflow.com/questions/50163211/cannot-evaluate-ml-model-on-structured-streaming-because-rdd-transformations-an

标签

apache-spark

apache-spark-ml

spark-structured-streaming