Nothing is being printed out from a Flink Patterned Stream

问题

I have this code below:

import java.util.Properties

import com.google.gson._
import com.typesafe.config.ConfigFactory
import org.apache.flink.cep.scala.pattern.Pattern
import org.apache.flink.cep.scala.CEP
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer010
import org.apache.flink.streaming.util.serialization.SimpleStringSchema

object WindowedWordCount {
  val configFactory = ConfigFactory.load()
  def main(args: Array[String]) = {
    val brokers = configFactory.getString("kafka.broker")
    val topicChannel1 = configFactory.getString("kafka.topic1")

    val props = new Properties()
    ...

    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

    val dataStream = env.addSource(new FlinkKafkaConsumer010[String](topicChannel1, new SimpleStringSchema(), props))

    val partitionedInput = dataStream.keyBy(jsonString => {
      val jsonParser = new JsonParser()
      val jsonObject = jsonParser.parse(jsonString).getAsJsonObject()
      jsonObject.get("account")
    })

    val priceCheck = Pattern.begin[String]("start").where({jsonString =>
      val jsonParser = new JsonParser()
      val jsonObject = jsonParser.parse(jsonString).getAsJsonObject()
      jsonObject.get("account").toString == "iOS"})

    val pattern = CEP.pattern(partitionedInput, priceCheck)

    val newStream = pattern.select(x =>
      x.get("start").map({str =>
        str
      })
    )

    newStream.print()

    env.execute()
  }
}

For some reason in the above code at the newStream.print() nothing is being printed out. I am positive that there is data in Kafka that matches the pattern that I defined above but for some reason nothing is being printed out.

Can anyone please help me spot an error in this code?

EDIT:

class TimestampExtractor extends AssignerWithPeriodicWatermarks[String] with Serializable {

  override def extractTimestamp(e: String, prevElementTimestamp: Long) = {
    val jsonParser = new JsonParser()
    val context = jsonParser.parse(e).getAsJsonObject.getAsJsonObject("context")
    Instant.parse(context.get("serverTimestamp").toString.replaceAll("\"", "")).toEpochMilli
  }

  override def getCurrentWatermark(): Watermark = {
    new Watermark(System.currentTimeMillis())
  }
}

I saw on the flink doc that you can just return prevElementTimestamp in the extractTimestamp method (if you are using Kafka010) and new Watermark(System.currentTimeMillis) in the getCurrentWatermark method.

But I don't understand what prevElementTimestamp is or why one would return new Watermark(System.currentTimeMillis) as the WaterMark and not something else. Can you please elaborate on why we do this on how WaterMark and EventTime work together please?

回答1:

You do setup your job to work in EventTime, but you do not provide timestamp and watermark extractor.

For more on working in event time see those docs. If you want to use the kafka embedded timestamps this docs may help you.

In EventTime the CEP library buffers events upon watermark arrival so to properly handle out-of-order events. In your case there are no watermarks generated, so the events are buffered infinitly.

Edit:

For the prevElementTimestamp I think the docs are pretty clear:

There is no need to define a timestamp extractor when using the timestamps from Kafka. The previousElementTimestamp argument of the extractTimestamp() method contains the timestamp carried by the Kafka message.

Since Kafka 0.10.x Kafka messages can have embedded timestamp.
Generating Watermark as new Watermark(System.currentTimeMillis) in this case is not a good idea. You should create Watermark based on your knowledge of the data. For information on how Watermark and EventTime work together I could not be more clear than the docs

来源：https://stackoverflow.com/questions/44965109/nothing-is-being-printed-out-from-a-flink-patterned-stream

标签

scala

apache-flink