背景
这几天看到Flink学习群问了一个问题,就是他们想实时监控用户session行为轨迹,如果当前session下用户点击了A事件,1小时内用户没有点击B事件,实时流输出C事件
拿电商页面举例子
Flink相关知识点
1:flink状态,由于按session聚合,需要使用keyby+process函数
2:通过flink的KeyedProcessFunction内部实现状态管理
3:然后运用KeyedProcessFunction中的定时触发器onTimer,实时定时判断
注意点:
TimerService 在内部维护两种类型的定时器(处理时间和事件时间定时器)并排队
执行。TimerService 会删除每个键和时间戳重复的定时器,即每个键在每个时间戳
上最多有一个定时器。如果为同一时间戳注册了多个定时器,则只会调用一次
onTimer()方法。
废话不多说,直接上代码
kafka代码:
import java.util.Properties
import kafka.producer.{KeyedMessage, Producer, ProducerConfig}
import scala.io.Source
object kafkaProduct {
def test1() = {
val brokers_list = "localhost:9092"
val topic = "flink2"
val props = new Properties()
props.put("group.id", "test-flink")
props.put("metadata.broker.list",brokers_list)
props.put("serializer.class", "kafka.serializer.StringEncoder")
props.put("num.partitions","4")
val config = new ProducerConfig(props)
val producer = new Producer[String, String](config)
var num = 0
for (line <- Source.fromFile("/Users/huzechen/Downloads/flinktest/src/main/resources/cep1").getLines) {
val aa = scala.util.Random.nextInt(3).toString
println(aa)
producer.send(new KeyedMessage(topic,aa,line))
}
producer.close()
}
def main(args: Array[String]): Unit = {
test1()
}
}
kafka测试数据:自己模拟写入
{"session_id":"0000015","event_id":"A"}
{"session_id":"0000016","event_id":"A"}
flink代码块:由于好多新人想让我代码多点注释,今天我就满足大家的意愿写一波详细的,当然不懂的可以下面加我微信:weixin605405145
import java.text.SimpleDateFormat
import java.util
import java.util.{Date, Properties}
import com.alibaba.fastjson.{JSON, JSONObject}
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.api.common.typeinfo.TypeInformation
import org.apache.flink.api.java.tuple.Tuple
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment, createTypeInformation}
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer08
import org.apache.flink.streaming.util.serialization.KeyedDeserializationSchema
import org.apache.flink.api.common.state.{StateTtlConfig, ValueState, ValueStateDescriptor}
import org.apache.flink.api.common.time.Time
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.util.Collector
import org.apache.flink.streaming.api.functions._
import org.apache.flink.streaming.api.watermark.Watermark
object SessionIdKeyedProcessFunction {
class MyTimeTimestampsAndWatermarks extends AssignerWithPunctuatedWatermarks[(String,String)] with Serializable{
//生成时间戳
override def extractTimestamp(element: (String,String), previousElementTimestamp: Long): Long = {
System.currentTimeMillis()
}
//获取wrtermark
override def checkAndGetNextWatermark(lastElement: (String, String), extractedTimestamp: Long): Watermark = {
new Watermark(extractedTimestamp -1000)
}
}
case class SessionInfo(session_id : String,event_id: String, timestamp:Long)
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val properties = new Properties()
//kafka位置 老版本的 kafka是配置zookeeper地址
properties.setProperty("bootstrap.servers","localhost:9092")
properties.setProperty("zookeeper.connect","localhost:2181")
val topic = "flink2"
properties.setProperty("group.id", "test-flink")
//初始化读取kafka的实时流
val consumer = new FlinkKafkaConsumer08(topic,new SimpleStringSchema(),properties)
val text: DataStream[Tuple2[String, String]] = env.addSource(consumer).map(line =>
{
val json = JSON.parseObject(line)
//返回用户session_id,用户事件event_id
Tuple2(json.get("session_id").toString,json.get("event_id").toString)
}).assignTimestampsAndWatermarks(new MyTimeTimestampsAndWatermarks())
text.keyBy(0)
.process(new SessionIdTimeoutFunction()).setParallelism(1).print()
env.execute()
//由于是按key聚合,创建每个key的状态 key=session_id
//实现KeyedProcessFunction内的onTime方法
class SessionIdTimeoutFunction extends KeyedProcessFunction[Tuple, (String, String), (String,String)]{
private var state: ValueState[SessionInfo] = _
private var sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
override def open(parameters: Configuration): Unit ={
super.open(parameters)
val config = StateTtlConfig.newBuilder(Time.minutes(5))
.setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)
.setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)
.build()
val valueStateDescriptor =new ValueStateDescriptor("myState1", classOf[SessionInfo])
valueStateDescriptor.enableTimeToLive(config)
state = getRuntimeContext.getState(valueStateDescriptor)
}
override def processElement(message: (String, String),
ctx: KeyedProcessFunction[Tuple, (String, String), (String, String)]#Context,
out: Collector[(String,String)]) = {
//用户sessionid用户行为轨迹
if(state.value() == null){
val timeStamp = ctx.timestamp()
//输出当前实时流事件,这次没有考虑事件先后顺序
//如果要对事件先后顺序加一下限制,state需要重新设计
//这次就简单实现一下原理,后边我再写一个针对顺序的代码
out.collect((message))
//如果状体是A,设置下次回调的时间。5秒之后回调
if(message._2 =="A"){
ctx.timerService.registerEventTimeTimer(timeStamp+5000)
state.update(SessionInfo(message._1,message._2,timeStamp))
}
}
//如果发现当前sessionid下有B行为,就更新B
println("当前时间:"+sdf.format(new Date(ctx.timestamp)))
if(message._2 == "B"){
state.update(SessionInfo(message._1,message._2,ctx.timestamp()))
}
}
override def onTimer(timestamp: Long, ctx: KeyedProcessFunction[Tuple, (String, String), (String, String)]#OnTimerContext, out: Collector[(String, String)]): Unit = {
//如果当前key,5秒之后,没有触发B事件
//并且事件一定到了触发的事件点,就输出C事件
println("onTimer触发时间:状态记录的时间_触发时间"+sdf.format(new Date(state.value().timestamp)) + "_" +sdf.format(new Date(timestamp)))
if(state.value().event_id !="B" && state.value().timestamp +5000 == timestamp){
out.collect(("SessionID为:"+ state.value().session_id,"由于5s内没有看到B触发C时间"))
}
}
}
}
}
数据验证结果:大家可以看到,当session_id收到A事件5s之后并且没有收到B事件,每个session_id都会触发C事件。
当前时间:2019-08-29 13:11:41
当前时间:2019-08-29 13:11:41
2> (0000010,A)
1> (000009,A)
当前时间:2019-08-29 13:12:01
当前时间:2019-08-29 13:12:01
4> (0000012,A)
3> (0000011,A)
当前时间:2019-08-29 13:12:13
当前时间:2019-08-29 13:12:13
5> (0000013,A)
6> (0000014,A)
当前时间:2019-08-29 13:12:24
当前时间:2019-08-29 13:12:24
onTimer触发时间:状态记录的时间_触发时间2019-08-29 13:11:41_2019-08-29 13:11:46
onTimer触发时间:状态记录的时间_触发时间2019-08-29 13:11:41_2019-08-29 13:11:46
onTimer触发时间:状态记录的时间_触发时间2019-08-29 13:12:01_2019-08-29 13:12:06
onTimer触发时间:状态记录的时间_触发时间2019-08-29 13:12:01_2019-08-29 13:12:06
3> (SessionID为:0000011,由于5s内没有看到B触发C事件)
4> (SessionID为:0000012,由于5s内没有看到B触发C事件)
8> (0000015,A)
7> (0000016,A)
2> (SessionID为:0000010,由于5s内没有看到B触发C事件)
1> (SessionID为:000009,由于5s内没有看到B触发C事件)
喜欢flink的朋友,可以加我微信:weixin605405145,拉你们进flink+spark交流群。
来源:CSDN
作者:小晨说数据
链接:https://blog.csdn.net/huzechen/article/details/100138205