背景
这几天看到Flink学习群问了一个问题,就是他们想实时监控用户session行为轨迹,如果当前session下用户点击了A事件,1小时内用户没有点击B事件,实时流输出C事件
拿电商页面举例子
Flink相关知识点
1:flink状态,由于按session聚合,需要使用keyby+process函数2:通过flink的KeyedProcessFunction内部实现状态管理3:然后运用KeyedProcessFunction中的定时触发器onTimer,实时定时判断
注意点:
TimerService 在内部维护两种类型的定时器(处理时间和事件时间定时器)并排队执行。TimerService 会删除每个键和时间戳重复的定时器,即每个键在每个时间戳上最多有一个定时器。如果为同一时间戳注册了多个定时器,则只会调用一次onTimer()方法。
废话不多说,直接上代码
kafka代码:
import java.util.Propertiesimport kafka.producer.{KeyedMessage, Producer, ProducerConfig}import scala.io.Sourceobject kafkaProduct {def test1() = {val brokers_list = "localhost:9092"val topic = "flink2"val props = new Properties()props.put("group.id", "test-flink")props.put("metadata.broker.list",brokers_list)props.put("serializer.class", "kafka.serializer.StringEncoder")props.put("num.partitions","4")val config = new ProducerConfig(props)val producer = new Producer[String, String](config)var num = 0for (line <- Source.fromFile("/Users/huzechen/Downloads/flinktest/src/main/resources/cep1").getLines) {val aa = scala.util.Random.nextInt(3).toStringprintln(aa)producer.send(new KeyedMessage(topic,aa,line))}producer.close()}def main(args: Array[String]): Unit = {test1()}}
kafka测试数据:自己模拟写入
{"session_id":"0000015","event_id":"A"}{"session_id":"0000016","event_id":"A"}
flink代码块:由于好多新人想让我代码多点注释,今天我就满足大家的意愿写一波详细的,当然不懂的可以下面加我微信:weixin605405145
import java.text.SimpleDateFormatimport java.utilimport java.util.{Date, Properties}import com.alibaba.fastjson.{JSON, JSONObject}import org.apache.flink.api.common.serialization.SimpleStringSchemaimport org.apache.flink.api.common.typeinfo.TypeInformationimport org.apache.flink.api.java.tuple.Tupleimport org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment, createTypeInformation}import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer08import org.apache.flink.streaming.util.serialization.KeyedDeserializationSchemaimport org.apache.flink.api.common.state.{StateTtlConfig, ValueState, ValueStateDescriptor}import org.apache.flink.api.common.time.Timeimport org.apache.flink.configuration.Configurationimport org.apache.flink.streaming.api.TimeCharacteristicimport org.apache.flink.util.Collectorimport org.apache.flink.streaming.api.functions._import org.apache.flink.streaming.api.watermark.Watermarkobject SessionIdKeyedProcessFunction {class MyTimeTimestampsAndWatermarks extends AssignerWithPunctuatedWatermarks[(String,String)] with Serializable{//生成时间戳override def extractTimestamp(element: (String,String), previousElementTimestamp: Long): Long = {System.currentTimeMillis()}//获取wrtermarkoverride def checkAndGetNextWatermark(lastElement: (String, String), extractedTimestamp: Long): Watermark = {new Watermark(extractedTimestamp -1000)}}case class SessionInfo(session_id : String,event_id: String, timestamp:Long)def main(args: Array[String]): Unit = {val env = StreamExecutionEnvironment.getExecutionEnvironmentenv.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)val properties = new Properties()//kafka位置 老版本的 kafka是配置zookeeper地址properties.setProperty("bootstrap.servers","localhost:9092")properties.setProperty("zookeeper.connect","localhost:2181")val topic = "flink2"properties.setProperty("group.id", "test-flink")//初始化读取kafka的实时流val consumer = new FlinkKafkaConsumer08(topic,new SimpleStringSchema(),properties)val text: DataStream[Tuple2[String, String]] = env.addSource(consumer).map(line =>{val json = JSON.parseObject(line)//返回用户session_id,用户事件event_idTuple2(json.get("session_id").toString,json.get("event_id").toString)}).assignTimestampsAndWatermarks(new MyTimeTimestampsAndWatermarks())text.keyBy(0).process(new SessionIdTimeoutFunction()).setParallelism(1).print()env.execute()//由于是按key聚合,创建每个key的状态 key=session_id//实现KeyedProcessFunction内的onTime方法class SessionIdTimeoutFunction extends KeyedProcessFunction[Tuple, (String, String), (String,String)]{private var state: ValueState[SessionInfo] = _private var sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")override def open(parameters: Configuration): Unit ={super.open(parameters)val config = StateTtlConfig.newBuilder(Time.minutes(5)).setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired).setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite).build()val valueStateDescriptor =new ValueStateDescriptor("myState1", classOf[SessionInfo])valueStateDescriptor.enableTimeToLive(config)state = getRuntimeContext.getState(valueStateDescriptor)}override def processElement(message: (String, String),ctx: KeyedProcessFunction[Tuple, (String, String), (String, String)]#Context,out: Collector[(String,String)]) = {//用户sessionid用户行为轨迹if(state.value() == null){val timeStamp = ctx.timestamp()//输出当前实时流事件,这次没有考虑事件先后顺序//如果要对事件先后顺序加一下限制,state需要重新设计//这次就简单实现一下原理,后边我再写一个针对顺序的代码out.collect((message))//如果状体是A,设置下次回调的时间。5秒之后回调if(message._2 =="A"){ctx.timerService.registerEventTimeTimer(timeStamp+5000)state.update(SessionInfo(message._1,message._2,timeStamp))}}//如果发现当前sessionid下有B行为,就更新Bprintln("当前时间:"+sdf.format(new Date(ctx.timestamp)))if(message._2 == "B"){state.update(SessionInfo(message._1,message._2,ctx.timestamp()))}}override def onTimer(timestamp: Long, ctx: KeyedProcessFunction[Tuple, (String, String), (String, String)]#OnTimerContext, out: Collector[(String, String)]): Unit = {//如果当前key,5秒之后,没有触发B事件//并且事件一定到了触发的事件点,就输出C事件println("onTimer触发时间:状态记录的时间_触发时间"+sdf.format(new Date(state.value().timestamp)) + "_" +sdf.format(new Date(timestamp)))if(state.value().event_id !="B" && state.value().timestamp +5000 == timestamp){out.collect(("SessionID为:"+ state.value().session_id,"由于5s内没有看到B触发C时间"))}}}}}
数据验证结果:大家可以看到,当session_id收到A事件5s之后并且没有收到B事件,每个session_id都会触发C事件。
当前时间:2019-08-29 13:11:41当前时间:2019-08-29 13:11:412> (0000010,A)1> (000009,A)当前时间:2019-08-29 13:12:01当前时间:2019-08-29 13:12:014> (0000012,A)3> (0000011,A)当前时间:2019-08-29 13:12:13当前时间:2019-08-29 13:12:135> (0000013,A)6> (0000014,A)当前时间:2019-08-29 13:12:24当前时间:2019-08-29 13:12:24onTimer触发时间:状态记录的时间_触发时间2019-08-29 13:11:41_2019-08-29 13:11:46onTimer触发时间:状态记录的时间_触发时间2019-08-29 13:11:41_2019-08-29 13:11:46onTimer触发时间:状态记录的时间_触发时间2019-08-29 13:12:01_2019-08-29 13:12:06onTimer触发时间:状态记录的时间_触发时间2019-08-29 13:12:01_2019-08-29 13:12:063> (SessionID为:0000011,由于5s内没有看到B触发C事件)4> (SessionID为:0000012,由于5s内没有看到B触发C事件)8> (0000015,A)7> (0000016,A)2> (SessionID为:0000010,由于5s内没有看到B触发C事件)1> (SessionID为:000009,由于5s内没有看到B触发C事件)
喜欢flink的朋友,可以加我微信:weixin605405145,拉你们进flink+spark交流群。
来源:CSDN
作者:小晨说数据
链接:https://blog.csdn.net/huzechen/article/details/100138205