问题
I have this code below:
import java.util.Properties
import com.google.gson._
import com.typesafe.config.ConfigFactory
import org.apache.flink.cep.scala.pattern.Pattern
import org.apache.flink.cep.scala.CEP
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer010
import org.apache.flink.streaming.util.serialization.SimpleStringSchema
object WindowedWordCount {
val configFactory = ConfigFactory.load()
def main(args: Array[String]) = {
val brokers = configFactory.getString("kafka.broker")
val topicChannel1 = configFactory.getString("kafka.topic1")
val props = new Properties()
...
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val dataStream = env.addSource(new FlinkKafkaConsumer010[String](topicChannel1, new SimpleStringSchema(), props))
val partitionedInput = dataStream.keyBy(jsonString => {
val jsonParser = new JsonParser()
val jsonObject = jsonParser.parse(jsonString).getAsJsonObject()
jsonObject.get("account")
})
val priceCheck = Pattern.begin[String]("start").where({jsonString =>
val jsonParser = new JsonParser()
val jsonObject = jsonParser.parse(jsonString).getAsJsonObject()
jsonObject.get("account").toString == "iOS"})
val pattern = CEP.pattern(partitionedInput, priceCheck)
val newStream = pattern.select(x =>
x.get("start").map({str =>
str
})
)
newStream.print()
env.execute()
}
}
For some reason in the above code at the newStream.print()
nothing is being printed out. I am positive that there is data in Kafka that matches the pattern that I defined above but for some reason nothing is being printed out.
Can anyone please help me spot an error in this code?
EDIT:
class TimestampExtractor extends AssignerWithPeriodicWatermarks[String] with Serializable {
override def extractTimestamp(e: String, prevElementTimestamp: Long) = {
val jsonParser = new JsonParser()
val context = jsonParser.parse(e).getAsJsonObject.getAsJsonObject("context")
Instant.parse(context.get("serverTimestamp").toString.replaceAll("\"", "")).toEpochMilli
}
override def getCurrentWatermark(): Watermark = {
new Watermark(System.currentTimeMillis())
}
}
I saw on the flink doc that you can just return prevElementTimestamp
in the extractTimestamp
method (if you are using Kafka010) and new Watermark(System.currentTimeMillis)
in the getCurrentWatermark
method.
But I don't understand what prevElementTimestamp
is or why one would return new Watermark(System.currentTimeMillis)
as the WaterMark and not something else. Can you please elaborate on why we do this on how WaterMark
and EventTime
work together please?
回答1:
You do setup your job to work in EventTime
, but you do not provide timestamp and watermark extractor.
For more on working in event time see those docs. If you want to use the kafka embedded timestamps this docs may help you.
In EventTime
the CEP library buffers events upon watermark arrival so to properly handle out-of-order events. In your case there are no watermarks generated, so the events are buffered infinitly.
Edit:
For the
prevElementTimestamp
I think the docs are pretty clear:There is no need to define a timestamp extractor when using the timestamps from Kafka. The previousElementTimestamp argument of the extractTimestamp() method contains the timestamp carried by the Kafka message.
Since Kafka 0.10.x Kafka messages can have embedded timestamp.
Generating
Watermark
asnew Watermark(System.currentTimeMillis)
in this case is not a good idea. You should createWatermark
based on your knowledge of the data. For information on howWatermark
andEventTime
work together I could not be more clear than the docs
来源:https://stackoverflow.com/questions/44965109/nothing-is-being-printed-out-from-a-flink-patterned-stream