Spark Structured Streaming, multiples queries are not running concurrently

烂漫一生 提交于 2019-12-08 00:53:53

问题


I slightly modified example taken from here - https://github.com/apache/spark/blob/v2.2.0/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCount.scala

I added seconds writeStream (sink):

scala
case class MyWriter1() extends ForeachWriter[Row]{
  override def open(partitionId: Long, version: Long): Boolean = true

  override def process(value: Row): Unit = {
    println(s"custom1 - ${value.get(0)}")
  }

  override def close(errorOrNull: Throwable): Unit = true
}

case class MyWriter2() extends ForeachWriter[(String, Int)]{
  override def open(partitionId: Long, version: Long): Boolean = true

  override def process(value: (String, Int)): Unit = {
    println(s"custom2 - $value")
  }

  override def close(errorOrNull: Throwable): Unit = true
}


object Main extends Serializable{

  def main(args: Array[String]): Unit = {
    println("starting")

    Logger.getLogger("org").setLevel(Level.OFF)
    Logger.getLogger("akka").setLevel(Level.OFF)

    val host = "localhost"
    val port = "9999"

    val spark = SparkSession
      .builder
      .master("local[*]")
      .appName("app-test")
      .getOrCreate()

    import spark.implicits._

    // Create DataFrame representing the stream of input lines from connection to host:port
    val lines = spark.readStream
      .format("socket")
      .option("host", host)
      .option("port", port)
      .load()

    // Split the lines into words
    val words = lines.as[String].flatMap(_.split(" "))

    // Generate running word count
    val wordCounts = words.groupBy("value").count()

    // Start running the query that prints the running counts to the console
    val query1 = wordCounts.writeStream
      .outputMode("update")
      .foreach(MyWriter1())
      .start()

    val ds = wordCounts.map(x => (x.getAs[String]("value"), x.getAs[Int]("count")))

    val query2 = ds.writeStream
      .outputMode("update")
      .foreach(MyWriter2())
      .start()

    spark.streams.awaitAnyTermination()

  }
}

Unfortunately, only first query runs, second never runs (MyWriter2 never been called)

Please advice what I'm doing wrong. According to doc: You can start any number of queries in a single SparkSession. They will all be running concurrently sharing the cluster resources.


回答1:


Are you using nc -lk 9999 for sending data to spark ? every query create a connection to nc but nc can only send data to the first connection (query) , you can write a tcp server instead of nc




回答2:


I had the same situation (but on the newer structured-streaming api) and in my case it helped to call awaitTermination() on the last streamingQuery.

s.th. like:

query1.start()
query2.start().awaitTermination()

Update: Instead above, this build-in solution/method is better:

sparkSession.streams.awaitAnyTermination()



回答3:


You are using .awaitAnyTermination() which will terminate the application when the first stream returns, you have to wait for both of the streams to finish before you terminate.

something like this should do the trick:

 query1.awaitTermination()
 query2.awaitTermination()


来源:https://stackoverflow.com/questions/45331883/spark-structured-streaming-multiples-queries-are-not-running-concurrently

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!