Opening two KafkaStreams after each other with different StreamingContext

问题

I am currently trying to implement a two staged process in spark streaming. First I open a kafkaStream, read everything that is already in the topic by using auto.offset.reset=earliest and train my model on it. I use a stream for that as I could not find out how to do it without opening a stream before (Spark - Get earliest and latest offset of Kafka without opening stream). As I have not discovered a way to stop the streams without stopping the whole StreamingContext I stop the context after model calculation with ssc.stop(true, true).

When I now try to create a new StreamingContext (using either the old sparkConfig or a new one with the same parameters), call my method to open a new KafkaStream with new groupId and auto.offset.reset=latest it looks like there is no streaming happening at all when I write new content to the Kafka topic. Neither print() nor count() nor println in forEachRDD are resulting in any output in my IDE.

The structure of the application looks like:

  def main(args: Array[String]) {

    val sparkConf = new SparkConf().setAppName(sparkAppName).setMaster(sparkMaster)
      .set("spark.local.dir", sparkLocalDir)
      .set("spark.driver.allowMultipleContexts", "true")

    sparkConf.registerKryoClasses(Array(classOf[Span]))
    sparkConf.registerKryoClasses(Array(classOf[Spans]))
    sparkConf.registerKryoClasses(Array(classOf[java.util.Map[String, String]]))

    val trainingSsc = new StreamingContext(sparkConf, Seconds(batchInterval))
    trainingSsc.checkpoint(checkpoint)
    //val predictor = (model, ninetynine, median, avg, max)
    val result = trainKMeans(trainingSsc);

    trainingSsc.stop(true, false)

    val predictionSsc = new StreamingContext(sparkConf, Seconds(batchInterval))
    val threshold = result._5
    val model = result._1

    kMeansAnomalyDetection(predictionSsc, model, threshold)  
}

I hope you can point me to the mistake I made - and if you need further details just let me know. Any help and hints are much appreciated.

回答1:

In general, the program looks like it's going in the right direction but there are few points that need fixing:

Spark Streaming will start the streaming scheduler when the streamingContext.start() is issued. DStream operations will only be executed by the scheduler. This means that sequencing these two calls will no bear any results:

val result = trainKMeans(trainingSsc);
trainingSsc.stop(true, false)

The streaming context will be stopped before any training could take place.

Instead, we should do this:

val result = trainKMeans(trainingSsc)
trainingSsc.foreachRDD{_ => trainingSsc.stop(false, false) } // note that we don't stop the spark context here
trainingSsc.start()
trainingSsc.awaitTermination()

In this case, we start the streaming process; we let the first interval execute, in which our model will be trained, and then we stop the processing.

The second stream should be started on a different group than the first one (kafka stream creation is not shown in the code snippet)

For the second streaming context, we are missing a start:

val predictionSsc = new StreamingContext(sparkContext, Seconds(batchInterval)) // note that we pass a sparkContext here, not a config. We reuse the same spark context.
val threshold = result._5
val model = result._1
kMeansAnomalyDetection(predictionSsc, model, threshold) 
predictionSsc.start()
predictionSsc.awaitTermination()

We should have a working stream at this point.

来源：https://stackoverflow.com/questions/45117513/opening-two-kafkastreams-after-each-other-with-different-streamingcontext

标签

scala

apache-spark

apache-kafka

spark-streaming