Apache Flink: ProcessWindowFunction KeyBy() multiple values

后端 未结 1 1187
庸人自扰
庸人自扰 2020-12-20 10:24

I\'m trying to use WindowFunction with DataStream, my goal is to have a Query like the following

SELECT  *,
    coun         


        
相关标签:
1条回答
  • 2020-12-20 11:06

    As city is a subcategory of country, you can aggregate the stream by city dimension first, then do another aggregation by country dimension.

    val parsed = stream2.map(x=> {
          val arr = x.split(",")
          (arr(0).toInt, arr(1), arr(2))
        })
    
    
        parsed
        .keyBy(x => x._3) 
          .window(TumblingProcessingTimeWindows.of(Time.seconds(60)))
          .process(new ProcessWindowFunction[
            (Int, String, String), (Int, String, String, Int), String, TimeWindow   
          ]() {
            override def process(key: String, context: Context,
                                 elements: Iterable[(Int, String, String)],
                                 out: Collector[(Int, String, String, Int)]): Unit = {  
              val lst = elements.toList
              lst.foreach(x => out.collect((x._1, x._2, x._3, lst.size)))
          }
          })
          .keyBy(x => x._2)
          .process(new ProcessWindowFunction[
            (Int, String, String), (Int, String, String, Int), String, TimeWindow   
          ]() {
            override def process(key: String, context: Context,
                                 elements: Iterable[(Int, String, String)],
                                 out: Collector[(Int, String, String, Int)]): Unit = {  
              val cnt = 0
              for(e:elements){
                 cnt += e._4
              }
    
              lst.foreach(x => out.collect((x._1, x._2, x._3, cnt)))
          }
          }).print().setParallelism(1)
    

    If one dimension is not a sub-dimension of the other dim, you can concat these 2 dims and generete a new key, then implement the aggregation logic in process func by yourself.

    keyBy(x=>x._2+x._3)
    

    UPDATE

    I think it is not possible to calculate the result in one process function cuz you are trying to do the statistics with different keys. The only way to do it in one step is that you set the global parallelism to 1(all input data will go to one downstream task even you use a keyby func) or broadcast the input data to all downstream tasks.

    Since your calculation actually have some common process logic, it would be better to do some abstraction.

    import org.apache.flink.streaming.api.functions.source.SourceFunction
    import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
    import org.apache.flink.streaming.api.scala._
    import org.apache.flink.streaming.api.scala.function.ProcessWindowFunction
    import org.apache.flink.streaming.api.windowing.time.Time
    import org.apache.flink.streaming.api.windowing.windows.TimeWindow
    import org.apache.flink.util.Collector
    
    object CountJob {
    
      @throws[Exception]
      def main(args: Array[String]): Unit = {
        val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    
        val transactions: DataStream[Record] = env
          .addSource(new SourceFunction[Record] {
            override def run(sourceContext: SourceFunction.SourceContext[Record]): Unit = {
              while (true) {
                sourceContext.collect(Record(1, "a", "b", "c", 1, 1, 1))
                Thread.sleep(1000)
              }
            }
    
            override def cancel(): Unit = {
    
            }
          })
          .name("generate source")
    
        transactions.keyBy(_.addr)
          .timeWindow(Time.seconds(1))
          .process(new CustomCountProc("ADDR"))
          .keyBy(_.city)
          .timeWindow(Time.seconds(1))
          .process(new CustomCountProc("CITY"))
          .keyBy(_.country)
          .timeWindow(Time.seconds(1))
          .process(new CustomCountProc("COUNTRY"))
          .print()
    
    
        env.execute("Count Job")
      }
    }
    
    // a common operator to process different aggregation
    class CustomCountProc(aggrType: String) extends ProcessWindowFunction[Record, Record, String, TimeWindow] {
    
      override def process(key: String, context: Context, elements: Iterable[Record], out: Collector[Record]): Unit = {
    
        for (e <- elements) {
          if ("ADDR".equals(aggrType)) {
            out.collect(Record(-1, e.country, e.city, key, e.country_cnt, e.city_cnt, elements.size))
          }
          else if ("CITY".equals(aggrType)) {
            out.collect(Record(-1, e.country, key, e.country, e.country_cnt, elements.size, e.addr_cnt))
          }
          else if ("COUNTRY".equals(aggrType)) {
            out.collect(Record(-1, key, e.city, e.addr, elements.size, e.city_cnt, e.addr_cnt))
          }
        }
    
      }
    }
    
    case class Record(
                       id: Int,
                       country: String,
                       city: String,
                       addr: String,
                       country_cnt: Int,
                       city_cnt: Int,
                       addr_cnt: Int
                     ) {
    }
    

    Btw, I am not sure if the output actually meets your expectation. As you didn't implement a stateful process function, I think you are trying to calculate the aggregation results for each batch of data, and each batch contains data ingested in a time window of one second. The output won't accumulate all the time, each batch will start from zero.

    By using timeWindow function, you also need to notice the TimeCharacteristic which by default is the processing time.

    The output may also delay because of using 3 consequent window function. Suppose the first process func has completed the aggregation within one second and forward the results downstream. As the second process func also has a timewindow of 1 second, it won't emit any result until it receives next batch of output from upstream.

    Let's see if others have better solution to your problem.

    0 讨论(0)
提交回复
热议问题