I\'m trying to use WindowFunction with DataStream, my goal is to have a Query like the following
SELECT *,
coun
As city
is a subcategory of country
, you can aggregate the stream by city
dimension first, then do another aggregation by country
dimension.
val parsed = stream2.map(x=> {
val arr = x.split(",")
(arr(0).toInt, arr(1), arr(2))
})
parsed
.keyBy(x => x._3)
.window(TumblingProcessingTimeWindows.of(Time.seconds(60)))
.process(new ProcessWindowFunction[
(Int, String, String), (Int, String, String, Int), String, TimeWindow
]() {
override def process(key: String, context: Context,
elements: Iterable[(Int, String, String)],
out: Collector[(Int, String, String, Int)]): Unit = {
val lst = elements.toList
lst.foreach(x => out.collect((x._1, x._2, x._3, lst.size)))
}
})
.keyBy(x => x._2)
.process(new ProcessWindowFunction[
(Int, String, String), (Int, String, String, Int), String, TimeWindow
]() {
override def process(key: String, context: Context,
elements: Iterable[(Int, String, String)],
out: Collector[(Int, String, String, Int)]): Unit = {
val cnt = 0
for(e:elements){
cnt += e._4
}
lst.foreach(x => out.collect((x._1, x._2, x._3, cnt)))
}
}).print().setParallelism(1)
If one dimension is not a sub-dimension of the other dim, you can concat these 2 dims and generete a new key, then implement the aggregation logic in process func by yourself.
keyBy(x=>x._2+x._3)
I think it is not possible to calculate the result in one process function cuz you are trying to do the statistics with different keys. The only way to do it in one step is that you set the global parallelism to 1(all input data will go to one downstream task even you use a keyby
func) or broadcast the input data to all downstream tasks.
Since your calculation actually have some common process logic, it would be better to do some abstraction.
import org.apache.flink.streaming.api.functions.source.SourceFunction
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.function.ProcessWindowFunction
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector
object CountJob {
@throws[Exception]
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val transactions: DataStream[Record] = env
.addSource(new SourceFunction[Record] {
override def run(sourceContext: SourceFunction.SourceContext[Record]): Unit = {
while (true) {
sourceContext.collect(Record(1, "a", "b", "c", 1, 1, 1))
Thread.sleep(1000)
}
}
override def cancel(): Unit = {
}
})
.name("generate source")
transactions.keyBy(_.addr)
.timeWindow(Time.seconds(1))
.process(new CustomCountProc("ADDR"))
.keyBy(_.city)
.timeWindow(Time.seconds(1))
.process(new CustomCountProc("CITY"))
.keyBy(_.country)
.timeWindow(Time.seconds(1))
.process(new CustomCountProc("COUNTRY"))
.print()
env.execute("Count Job")
}
}
// a common operator to process different aggregation
class CustomCountProc(aggrType: String) extends ProcessWindowFunction[Record, Record, String, TimeWindow] {
override def process(key: String, context: Context, elements: Iterable[Record], out: Collector[Record]): Unit = {
for (e <- elements) {
if ("ADDR".equals(aggrType)) {
out.collect(Record(-1, e.country, e.city, key, e.country_cnt, e.city_cnt, elements.size))
}
else if ("CITY".equals(aggrType)) {
out.collect(Record(-1, e.country, key, e.country, e.country_cnt, elements.size, e.addr_cnt))
}
else if ("COUNTRY".equals(aggrType)) {
out.collect(Record(-1, key, e.city, e.addr, elements.size, e.city_cnt, e.addr_cnt))
}
}
}
}
case class Record(
id: Int,
country: String,
city: String,
addr: String,
country_cnt: Int,
city_cnt: Int,
addr_cnt: Int
) {
}
Btw, I am not sure if the output actually meets your expectation. As you didn't implement a stateful process function, I think you are trying to calculate the aggregation results for each batch of data, and each batch contains data ingested in a time window of one second. The output won't accumulate all the time, each batch will start from zero.
By using timeWindow
function, you also need to notice the TimeCharacteristic
which by default is the processing time.
The output may also delay because of using 3 consequent window
function. Suppose the first process func has completed the aggregation within one second and forward the results downstream. As the second process func also has a timewindow
of 1 second, it won't emit any result until it receives next batch of output from upstream.
Let's see if others have better solution to your problem.