问题
I've got a use case that seems like it should be supported by Kinesis Analytics SQL, but I can't seem to figure it out.
Here is my scenario:
- I have an input stream of data where each event has an event_time field and a device_id field.
- I want to aggregate data by event_time and device_id. Here event_time is provided as a field in the source data, it is not the ROWTIME that the row was added to the Kinesis Analytics application, nor the approximate arrival time.
- The processes that send data to my stream have some delays, so rows may be added to my stream up to 3 minutes after the event_time has occurred.
My goal is to get a report that summarizes by event_time and device_id that has one row per event_time, and contains all data for that event_time in that one row.
So, my data stream could look like:
rowtime, event_time, device_id, num_things
12:29:04, 12:27:00, server1, 19
12:30:22, 12:28:00, server1, 33
12:30:23, 12:27:00, server2, 8
12:30:25, 12:29:00, server1, 11
12:31:33, 12:28:00, server2, 2
12:31:44, 12:29:00, server3, 83
12:32:56, 12:29:00, server2, 6
The key point here is that the data for event_times, like 12:27, comes in over a few minute period and can be up to 3 minutes earlier than when those are added to the Kinesis Analytics stream.
And I want my output to be:
event_time, total_num_things
12:27, 27 <- sums up 19 + 8 for event_time 12:27
12:28, 35 <- sums up 33+2 for event_time 12:28
12:29, 100 <- sums up 11+83+6 for event_time 12:29
Is this possible?
All the examples I can find would have a tumbling window of ROWTIME in the output, and thus aggregation of event_time would be potentially broken up across mutiple ROWTIME minute buckets.
回答1:
LAG is now available ... perhaps it helps.
http://docs.aws.amazon.com/kinesisanalytics/latest/sqlref/sql-reference-lag.html
回答2:
For those who did not move to a new tech ;-). Sliding window is less appropriate here since we don't set constraints on events over time interval, rather we want always group by time and then sum. Just the events are not available immediately.
So the semantics is more close to working session, where sessionId is the point in time.
This can be expressed in Drools:
Types:
package com.test;
import java.util.List;
declare EventA
@role(event)
eventTime: long;
deviceId: int;
numThings: int;
seen: boolean;
end
declare Group
eventTime: long @key;
events: List;
end
declare Summary
eventTime: long;
sumNumThings: int;
end
Rules:
package com.test;
import java.util.List;
import java.util.ArrayList;
import java.util.stream.Collectors;
rule "GroupCreate"
when
// for every new EventA
EventA(seen == false, $time: eventTime) from entry-point events
// check there is no group
not (exists(Group(eventTime == $time)))
then
insert(new Group($time, new ArrayList()));
end
rule "GroupJoin"
when
// for every new EventA
$a : EventA(seen == false) from entry-point events
// get event's group
$g: Group(eventTime == $a.eventTime)
then
$g.getEvents().add($a);
modify($a) {setSeen(true);}
end
rule "Summarize"
// if session timed out, clean up first
salience 5
when
// for every EventA
$a : EventA() from entry-point events
// check there is no more events within 30 seconds
not (exists(EventA(this != $a, eventTime == $a.eventTime,
this after[0, 30s] $a) from entry-point events))
// get event's group
$g: Group(eventTime == $a.eventTime)
then
int sum = (int)$g.getEvents().stream().collect(
Collectors.summingInt(EventA::getNumThings));
insertLogical(new Summary($g.getEventTime(), sum));
// cleanup
for (Object $x : $g.getEvents())
delete($x);
delete($g);
end
You can author Drools Kinesis Analytics with this service
回答3:
Seems that "Stagger Windows" is what you are looking for.
https://docs.aws.amazon.com/kinesisanalytics/latest/dev/stagger-window-concepts.html
Using stagger windows is a windowing method that is suited for analyzing groups of data that arrive at inconsistent times. It is well suited for any time-series analytics use case, such as a set of related sales or log records.
For example, VPC Flow Logs have a capture window of approximately 10 minutes. But they can have a capture window of up to 15 minutes if you're aggregating data on the client. Stagger windows are ideal for aggregating these logs for analysis.
Stagger windows address the issue of related records not falling into the same time-restricted window, such as when tumbling windows were used.
来源:https://stackoverflow.com/questions/44442606/analyze-a-tumbling-window-with-a-lag-in-aws-kinesis-analytics-sql